Building a distributed system that is reliable and performant, needs good planning and design.
In this article, we’ll navigate through a list of topics we need to keep in mind when designing a distributed system.
These topics should help us understand the different challenges that we might face.
Below is the list of the 8 Fallacies of Distributed Systems, that we will navigate through:
- The network is reliable
- Latency isn’t a problem
- Network bandwidth is infinite
- The network is secure
- Topology does not change
- There is one administrator
- Transport cost is zero
- The network is homogenous
1. The network is reliable
API Calls will sometimes fail. Network calls will sometimes get timeout out. Network outage will happen.
These are a few scenarios that might (and will probably) happen in a typical network. Most of the applications today deployed as a distributed system will make calls via API’s. The calls can be to external party, or to some service running on the same network.
We need to keep in mind what will happen if these calls fail? How will our system response? will there be any data loss?
Let’s consider the following example:
var paymentHandler = new PaymentHanlder();
paymentHandler.Process(paymentDetails);
What happens if the network call fails? What if we got a timeout?
If we simply retry the request, how do we ensure that we do not double charge the client?
Possible solution is to mitigate the network unreliability via retry and acknowledgement mechanism via some queueing system (Store and forward – Enterprise Integration Patterns).
When using queue system, we can utilize dead letter queue to capture failed attempts to process messages, and provide proper handling for them. This also provides us the ability to process the messages that are in the dead letter queue in a separate process.
2. Latency isn’t a problem
Latency is something that usually gets neglected when developing an application. Yet, in some use cases, latency can have major effects and the performance of the system.
So what is latency?
Latency is the inherent delay present in any communication, regardless of the size of the transmission. The total time for a communication will be:
TotalTime = Latency + (Size / Bandwidth)
Fallacy #2: Latency is zero • Particular Software
How to tackle latency issues?
Minimize network round trips as much as possible
When sending requests over the network multiple times, the latency penalty is repeated, for each time the request is sent. Pack any data that you need in a DTO and send it. Do not send oversized and bloated DTO’s, that eventually result is large size, which ultimately will increase the total time for them to be delivered.
Utilize in-memory cache
Latency within the same data center might not be such a big deal of problem, but when working with cloud applications, and cloud distributed servers that are deployed in multiple geographical regions, the case is different.
Utilizing in-memory cache can save expensive network call to the database, especially if the database server is located in a different geographical location.
Consider the geography
If possible, place your server as close as much together as possible. Consider using a CDN to deliver the requested resources.
Working asynchronously
Consider using an event based design for the system. In this design, we do not need to wait for the result of the computation of data, and can get away with the network latency here, although of course, we still have to pay a little penalty when publishing the event itself.
3. Network bandwidth is infinite
The real problem with bandwidth is not one of absolute speed but one of scale. It’s not a problem if I want to download a big movie. The real problem is if everybody else wants to download big movies too.
When transferring lots of data in a given period of time, network congestion can easily occur and affect the absolute speed at which we’re able to download. Even within a LAN, network congestion and its effect on bandwidth can have a noticeable impact on the applications we develop.
Another example is transferring large amount of data over the wire as a respond to many client requests or simply a
large dataset sent to the client machine as a result of some database query.
Suggested solutions
- Do not eagerly fetch all the data.
Consider only what we need, and if we really need to be eagerly fetched. - Consider network separation.
Using separate networks for separate business domain models, can help lower the impact of the overall network transport. Dedicated network for time-critical applications and servers. - Consider using claim check pattern
Store the entire message payload into an external service, such as a database. Get the reference to the stored payload, and send just that reference to the message bus. The reference acts like a claim check used to retrieve a piece of luggage, hence the name of the pattern. Clients interested in processing that specific message can use the obtained reference to retrieve the payload, if needed.
(For more information, refer to Claim-Check pattern – Cloud Design Patterns | Microsoft Docs)
4. The network is secure
The only completely secure system is one that is not connected to any network, and probably locked behind a safe or some other form of physical protection.
We cannot assume that the network is 100% secure, hence taking precautions and security measure is highly recommended and advised.
OWASP Top Ten offers a list of the most common top threats today, and how to mitigate them. When we design distributed software (And in my opinion, any software), we should always have security aspect in mind.
5. Topology does not change
Network topology will constantly change. Sometime the change will be a planned one, like adding new servers, decommissioning old ones, adding new networks, new storage locations. Or it changes due to some unplanned events, like server failure or hardware failure. Today, with cloud computing services like Docker, Azure Kubernetes service, Amazon Elastic Container service (or respective product at other vendors) and a lot more are available to (almost) everyone and every company, the network topology change is even more visible.
How can we mitigate against network and topology changes?
- Abstraction of network specifics
– We should not relay on hardcoded IP addresses, or absolute network locations. Instead, we should consider using a DNS server or service to resolve hostname IP addresses.
– Consider using a service discovery pattern and service discovery tools. - Design for failure and mitigate against it
– Avoid irreplaceable servers: We should assume that any server can and might fail, thus, we need to design the architecture to avoid irreplaceability of one server or another.
– Server groups: If we treat our servers as groups, where in each group any server can fail, the group overall functionality and capabilities should be not impacted.
– Consider introducing chaos engineering to test system behavior under infrastructure, network, and application failures. Tools like Netflix’s Chaos Monkeytakes this approach, and randomly shuts down a server or a service in a group.
For more information on Chaos Engineering, refer to the following links:
https://netflixtechblog.com/tagged/chaos-engineering
https://searchitoperations.techtarget.com/definition/chaos-engineering
6. There is one administrator
There is no one person that knows everything, or at least we should always keep this one in mind. It is likely that your application will be deployed in an environment where you do not have full control over it, or that is different from the local development environment in one of many ways, like operating system, storage, network policies and more.
The network, or the environment the application is deployed it, will be probably be managed by multiple teams or multiple people: security, storage, network, etc…
The more people involved in the environment topology and management, the less control you have.
What are possible solutions for the above?
- DevOps and admin involvement
Involve DevOps teams and system administrator from the start. Keeping the DevOps and infra teams updated with your progress can help you spot constrains and problems in early stages. - Logging and monitoring
Diagnosing issues in a distributed system is not a trivial, especially in complex distributed systems. To gain better visibility into the system behavior, ensure centralized logging, metrics and tracing are kept in mind and implemented. There are many tools available today, some of them have matured to and become very powerful in this area.
– Rich logging and monitoring tools will enable you to extract data on events and failure more easily.
– We should keep in mind the observability of the application, and how you can improve it.
For more information, refer to The Three Pillars of Observability - Decoupling
Ensuring appropriate decoupling between system components allows for greater resiliency in the event of either planned or unplanned downtime.
7. Transport cost is zero
Whereas the 2nd fallacy relates to the time it takes to transport data over the network, this fallacy related to the resources that are required to do. Transport cost is not zero. It might seem negligible, especially is of the system is deployed in its own hardware, the case is not true when the system is distributed over multiple geographical locations. This become even more impactful when the system is deployed on the cloud.
Serialization/Deserialization and object mapping
Serialization and deserialization consumes CPU time, so it costs money. If your app is deployed on premise, this cost is somewhat hidden if you don’t actively monitor resource consumption.
The cost of the networking infrastructure
The actual cost of the network infrastructure that are in use, such as servers, SANs, network switches, load balancers. All this costs money.
While the cost of infrastructure is mostly outside your control, optimizing its use is within your domain.
Make sure you pick the right data format for the job at hand, for example, SOAP or XML is more expensive than JSON. JSON is more expensive than binary protocols like Google’s Protocol Buffer.
8. The network is homogenous
The network is not homogenous. It contains different types of devices, from Mobile phones, desktop computer, IoT devices, servers and more, which will need to communicate with each other in some way. They may not be interoperable and a variety of standards-based and proprietary protocols and data formats may be in use.
Possible solution to this fallacy? Interoperability.
We should try to avoid using proprietary protocols, and prefer using standard ones, like JSON, XML or Protocol Buffers.