Enterprise-grade software integrations are a critical component in modern business software. As such, they handle large volumes of data.
(An “integration”, in this post, means linking different software applications to act as a coordinated whole.)
Let's assume the integration is built to high quality standards. (This is a topic big enough for several other posts). The vast majority of times the integration tries to move data, it will do so with success...but not every time.
Consider this scenario. In a day, you sync thousands of purchase orders from System A to System B. Remarkably, only two or three fail to reach their destination.
But, those uncommon failures matter!
Your user doesn’t care that only one or two records didn’t make it from A to B, because they needed 100%. Their businesses and their customer or vendor relationships depended on it. That failure might be a million dollar order. It might be for a top tier customer. It might be an urgent notification that absolutely must be sent.
Missing even small numbers of records can impact your users’ books. It can impact their customer experience. It can impact cash flow. It could cause data privacy or personalization mistakes.
Therefore, you need to think about how to build integrations that do the job right 99.99% of the time. You also need to think about how to surface that small minority of times it fails to do that job.
Those failures will happen, and they need to be addressed quickly and appropriately. Doing so is what it means to deliver “supportable” integrations. In this post, we’ll talk about what it means to be a “supportable” integration. We'll also give some guidance for how to design and build supportable integrations.
Why do integrations fail?
It’s helpful to understand the common reasons integrations fail.
Integrations fail to deliver data from one system to another for many reasons. Some of them cause painful, widespread failures. Some of them cause one failure among thousands or millions of successes.
There are virtually unlimited reasons an integration can fail. But, there are some common categories. These are important to understand, because they are what you design for. They are what you consider when creating an integration that is supportable.
End User Mistakes or Unexpected Behaviors
When you design an integration, start with a document that defines the cross-product use cases. In other words, what is the user trying to achieve that traverses two software solutions? Agile user story format (or something similar) is a good way to define these use cases.
You should also consider different ways the user will do the task. This will help you predict how users will interact with an integration, so you can test it like they will use it.
But, you can’t predict everything that a user will do!
One of the most common reasons an integration fails is because a user did something that you didn’t expect. Maybe they installed some third party plugin that changed a system’s behavior. Maybe they clicked the button that triggers the integration 20 times quickly. Maybe they dropped a Microsoft Word file into an integration that expected XML.
Users do all sorts of things you wouldn’t expect! But, it’s not realistic to predict and build for infinite possibilities.
Unexpected or Unsupported Data
Integrations get data from system A in a certain format, then put it into system B in another format. A good design process should account for the default formats of each system. It should also consider the possibilities for how those formats can change.
How flexible data is within a given software system can vary a lot. Some systems are very locked down. Others are very open ended. The integration should account for this relative data flexibility within the integrated systems.
But, again, you can’t account for every possible change you’ll see in the wild.
It’s pretty common that a user takes some action that changes the data in a system that breaks the integration. That usually either means running into an error. It can also mean atypical behavior, even if it doesn’t trigger an error.
This can also show up when one of the integrated systems rolls out changes to an API. What if the integration hasn’t anticipated it? This can cause new bugs to emerge. Usually, you shouldn’t roll out breaking API changes to the public, but it does happen.
Integration Design Deficiencies
Integration is hard. The people who build integrations are fallible. Sometimes, an integration fails, because it just isn’t built the right way.
These problems are usually the second thing to look for if your issue doesn’t appear to be data or end user related. They are also a little harder to spot. If you build integrations using a framework, it’ll be easier to identify issues. If you build integrations from scratch, it's less likely there will be a predictable way to find them.
People who build integrations will make occasional mistakes. This is true with code or "no code". It happens, especially with complex enterprise integrations. But, usually the root cause of such mistakes, especially if they are recurrent, is process.
We’ve written many articles on building an effective process for integration. The right process can mitigate these problems and defend against mistakes.
Sometimes the integrated endpoint systems themselves. These issues can cause the integration to fail.
A classic example is when one system has an API outage. The integration may be running fine, but all the sudden: the dreaded “500 Internal Server Error” response.
There are many reasons a software product can fail to do its job, 500 errors being one. The ways those impact the integration may be obvious. Or, they may cause an integration to do unexpected things. When that happens it makes the root cause harder to find.
When integrating to well-adopted software products, you can assume this is rare. But, results may vary if you are integrating to:
- Custom or homegrown software
- Early stage software products
- On-premise enterprise software
Your integration code runs somewhere. This is true if written “in house” and if you’re using a commercially available iPaaS. Servers fail. Load balancers fail. DevOps people make mistakes.
That means infrastructure failures can cause an integration to run improperly. These failures are the least common of reasons. Likewise, managing servers and infrastructure is outside of the scope of this post.
But, know that they can happen, sometimes being the cause for a failed integration. When they do happen, the problem is usually widespread. It requires help from your infrastructure team, DevOps team, or cloud provider. It also tends to be harder to clean up after.
What does it mean to be supportable?
The integrations fail only rarely.
There isn’t a failure count universally considered reasonable, but the number should be small. Integrations that regularly fail are expensive, because they need a lot of work. They also create poor customer experiences.
Strive for 0 failures. Accept that some will occur. But, make sure they are exceptions.
You are notified of failures quickly.
While rare, it’s acceptable and unavoidable that integrations fail sometimes. Since failures are rare, they may be hard to notice. You need to make sure the integration proactively notifies you of those failures. It needs to do this in real time, as well.
You can find the relevant clues.
It’s one thing to report a failure, but if you don’t have enough information to know the cause, what’s the point? You can’t go on a log diving excursion every single time an alert goes off. It’ll be too hard to address problems in an acceptable time frame. That means in addition to notifying, an integration must direct you to clues about why. These clues typically include relevant logs, event streams, and other details.
You can act upon those clues.
What good is knowing that an integration failed and even why it failed if you can’t do anything about it? Your integration needs to be able to retry, re-run, or clean up after a failure. As much as possible you should make this automatic or easy to do.
The integrations are automatic.
You can’t hire people to watch integrations run all day, every day. Even if you could, most move so much data that a person couldn’t keep up. That means in order for an integration to be supportable, it must be automatic. This is especially true, when it comes to handling unexpected changes.
Some of the common things an integration must be automatically resilient to include:
- Unexpected API outages
- Surges in data that are outside of the normal range
- Nefarious requests from bots or hackers
- Cloud infrastructure problems (e.g. an AWS outage)
The more capable an integration is to handle these kinds of things, the more supportable it is.
Questions You Should Be Able to Answer
It’s one thing to understand what makes an integration supportable. It’s another to apply that understanding to actually building a supportable integration.
We’ll walk through some questions that can show if your integration is supportable. Then we’ll talk about best practices for building a supportable integration. If you have succinct answers to the following questions, you're on the right track.
How do you know an integration failed?
It almost goes without saying, but you do need to be notified when your integration fails. If it fails silently, high business impact mistakes get discovered too late.
Some problems also grow the longer they remain unaddressed. You need to know that the integration failed as quickly as possible, so a small problem doesn't become a big one.
What do you do about it?
The integration should also help you understand what to do about the failure. Even suggestions are better than nothing.
When integrations fail, they tend to do so with technical, esoteric error messages. They are hard for non-engineers to make sense of.
You should be able to identify which issues go straight to the engineers. You should also know which ones someone non-technical can address. You should be able to direct either toward how to solve the problem.
Who does something about it?
It’s also important that the integration notifies people in such a way that it’s clear whose job it is to act. Most enterprise software implementations have many teams with layers of overlapping responsibilities. When the alarm goes off, whose job is it to respond?
How does this responsibility relate to SLAs or other end customer obligations?
Product or tech-enabled services providers that deliver integrations, often have SLAs or contractual obligations for support. You must consider how the integration you build helps you meet those obligations. If that doesn’t happen there will be a painful conversation if an impactful failure occurs.
What happens if the issue doesn’t get noticed?
Sometimes you don’t catch a failure. Maybe something particularly uncommon happened. Maybe you just don’t have your operational ducks in a row. It happens.
You want to think about what the integration does in that case. Are you able to build in any safety measures that:
- Escalate notifications?
- Kill the integration to prevent a problem from getting bigger?
- Heal on their own?
Best Practices for Supportable Integrations
With all this said, the following are some best practices you can follow. They will help you build more supportable integrations. They also embody all the advice shared so far in the post.
Build integrations in consistent ways.
If you’re only taking on one or two integration projects, you can deliver them how you want. You don’t need to put a ton of thought into them beyond “getting them done”.
If you sell a software product or a tech-enabled service, this isn’t your reality. Likely, most or all your customers need one, two, or more integrations. There is probably overlap on the systems all those customers need integration to. But, there will be variations within that overlap.
The only way to deliver all those integrations at high quality is to use a standard approach. Consistency is key! This means things like:
- Implementing a single, global integration framework or architecture
- Sharing code wherever possible
- Solving similar problems across integrations in similar ways
- Reducing variation and operationalizing processes
- Centralize key functions like authentication, monitoring, and API call management
This will all feel like overkill for the first few integrations. If those few integrations are where it ends, it would be overkill.
But, if they are the first few of an eventual many, you should think about scale in the beginning. It’ll pay dividends as your integration portfolio grows.
Define a standard taxonomy for “what happened”.
A big part of that consistency should include consistent history. Integrations must track “what happened” using logging and event streams. They should all do it the same way
Define a single taxonomy for all the happenings across all integrations. This usually includes important and universal information like:
- What triggered the integration to start and when did it happen?
- What is the business reference identifier for the synced data?
- What thing completed a job (made an API request, transformed data, etc.)?
- What end customer is the integration responsible for?
If every integration uniquely logs “what happened”, solving problems gets hard. That will trickle down to poor answers to customer questions. It also makes it impossible to track any metrics about your integrations as a group.
Put simply, it makes supporting the integration harder. Integration is already hard enough!
Use open standards.
But, here’s the good news! A lot of the problems you will be trying to solve have been solved–at least tangential problems have.
You want to standardize, but don’t invent your own standard unless you absolutely must. Wherever possible use an open or publicly available standard!
This saves you the time and difficulty of inventing a standard. It also makes your integrations more portable. It makes them more familiar to new engineers you may hire into your team. It makes your integrations more compatible with other systems.
And, the more broadly adopted the standard, the better.
For example, Doohickey-managed integrations emit all history events using the Open Telemetry standard. This means the data is compatible with many other observability products.
Use layers of information.
Most software has multiple layers of architecture (e.g. UI, API, database). Your integrations should use a similar approach.
You should also use this “layered” approach with how you structure support data. Make sure information flows to the appropriate layer(s) and consider who has access to each and why.
These layers probably include:
- The external event(s) that triggered an integration to execute
- The stream of events making up the execution of the integration
- The logs created during execution
- The logs created tangentially to the integration having executed
- The runtime’s performance and compute metrics
All these are useful when it comes to figuring out “what happened”. How they are useful and to whom they are relevant will vary. But, consider how they all are potentially relevant. Build processes around using them.
Build a process and an integration
Speaking of processes, you should be building a process and an integration.
So far we’ve talked about best practices that are tangible and technical, like “build it this way”. All this is to create integrations that are built and supported consistently.
You must design the integration and what the people around the integration will have to do, when, and why. People make up most of the expense and complexity for supporting an integration.
Process helps to reduce human-caused variability. Build your integrations to support that process. Build them alongside the process!
Consider the impact of failure.
If you sat down for long enough, you could fill pages with ways a given integration could fail. Integrations are complex, and there are many things that can go wrong.
But, not all failures are equal!
Some failures have a massive economic or operational impact. Some are noteworthy events that have very little impact. Most are somewhere between.
Consider the impact of different failures. Gauge how much time and money to spend building systems around those failures. Think about these problems in economic terms.
This post isn’t about implementing perfection. Please be thoughtful about where you apply this advice.
Move troubleshooting as close as possible to end customers.
Don’t be afraid to include customers in troubleshooting too! They use the integrations and definitely have a stake in addressing failures fast.
You should generally move troubleshooting as close as possible to the end customer. Anything support-wise you can empower the customer to do themself, you should consider enabling.
The escalation level from there should be customer-facing support. Then engineers with customer experience. Most teams start with the latter of these. Don't do that!
Show end customers enough, but not too much.
That said, you only want to show customers enough to support the above strategy.
There might be an inclination to show customers everything. After all, you just read “move troubleshooting close to the customer!" But, too much information can overwhelm or confuse customers. This is especially true if the info is not relevant to how they can solve their problem.
As an example, showing technical server logs to non-technical customers will not help. They probably can't use them to solve the problem that has arisen. Provide them information, summarized into terms they can understand. This isn’t a matter of transparency, it’s a matter of effective communication.
(To be clear, we don’t advocate that you lie to or hide important information from your customers. Don’t do that, please!)
It’s easy to forget about what happens once the integration goes live. So much effort goes into what it’ll take to build and launch the integration, but it doesn’t end there.
We often say building an integration is like adopting a dog. It doesn’t end after that first fun (if you like dogs) weekend. It’s a long term commitment, and while it gets easier, you need to have a plan.
Building supportable integrations over the long term starts up front. Ask and answer the questions we discussed in this post, and you're well on your way!