Case Study

1) Introduction

Apex is an API proxy for microservices. It provides one place to log and control service-to-service traffic.

Apex is designed for small teams that have just begun migrating from a monolith to a microservices architecture. While microservices bring many benefits such as faster deployment cycles, they also bring a host of new challenges by relying on the network for service-to-service communication. As network communication is unreliable and has latency, faults become more likely to occur, leading teams to have to spend more time diagnosing network-related faults, and writing pre-emptive fault-handling logic within each service. [1]

Some current solutions exist to help teams perform these tasks faster. Client libraries can be imported into each service’s code to automate networking concerns, an API gateway can be inserted in front of all services to handle incoming traffic, and for large systems, a service mesh is often deployed to abstract away networking concerns from services altogether. These are all valid solutions with their own set of trade-offs.

For a small team running their first few microservices, however, none of the existing solutions provide the right set of trade-offs: optimized for service-to-service traffic, and ease of deployment and operation, over high availability and scalability. These are the trade-offs that underpinned Apex’s design.

With Apex, a user can view the logs for all service-to-service traffic by querying just one table, while grouping all requests and responses that belong to the same workflow. They can also define and update traffic rules such as the number of times to retry a request in one configuration store.

2) Microservices

To understand how Apex makes it easier to work with microservices, it is important to first understand what microservices are. This, in turn, requires understanding that the microservices architecture is a choice, the other choice being, of course, a monolith.

2.1) Monolithic architecture

In a monolithic architecture, there is typically just one application server (the ‘monolith’) that holds all the business logic. In some cases, this application server alone is already sufficient to serve an application to a user (e.g. a website with just static HTML). More likely though, the application will also generate some user data that must be persisted, and so the monolithic application server will also transfer data to and from a database server.

Consider the above example of a monolithic system that serves an e-commerce store to users. The business logic in the app server can be organized into classes or modules, or more generally, ‘subsystems’, that encapsulate related functionality e.g. manipulating customer data, checking and updating inventory, creating shipments. These subsystems can each expose an interface of methods, or more generally ‘behaviors’, that can be invoked by each other to facilitate communication between them.

As method or function calls take place within the same running process in memory, they are reliable and very fast, with latency often measured in microseconds [2].

In-process method calls are reliable and fast

Another possible monolithic architecture is to further decouple the data store for each subsystem, by separating it into multiple database servers. For example, the customers subsystem and the orders subsystem can be connected to separate database servers, if that is deemed to be e.g. more flexible or scalable for a particular need.

A simple analogy for a monolithic application is a small business run by just one owner. The owner has to do everything - sales and marketing, procurement, operations, finance, IT. There may be one central log book that keeps track of all business data, or the owner could use several ‘persistent data stores’ in parallel e.g. CRM system for sales data, accounting software for financial data, ERP system for inventory data, pen and paper for tax filings.

2.2) Microservices architecture

The microservices architecture differs from the monolith in two major ways. First, subsystems are decoupled even further. Each subsystem is deployed independently to its own app server as a standalone ‘application’, or ‘service’, and the current best practice is for every service to have its own database [3].

Monolith architecture vs microservices architecture

Secondly, subsystems now communicate over the network via HTTP requests, rather than through in-process method invocations. So for example, if our orders service needs to create a new shipment, it might do this by sending a POST request to the /shipments endpoint of the shipping service, and attaching any other relevant information in the request body.

Microservices communicate over the network

Going with the same analogy of a small business, the microservices architecture is comparable to a small team of several members (or ‘services’) who each specialize in one function. For example, these could include a salesperson, a marketer, an operations manager, accountant/bookkeeper and an IT manager. Now, function-to-function communication no longer happens in the owner’s head (or ‘in-process’); instead, different team members must communicate with each other in person, on the phone or by email (or ‘over the network’) to get things done.

As we shall see, the use of the network for communication between subsystems is the key enabler for many of the benefits of the microservices architecture, but also the main culprit behind many of its drawbacks.

2.3) Microservices benefits

A first benefit of microservices is a wider choice of technologies for service developers. [4] The network boundaries between services free them from having to use the same technology stack. As long as each service maintains a stable network interface, or API, for other services talking to it, it is free to choose the language or framework that is most optimal for implementing its business logic.

Microservices can use different technologies

Arguably the most defining benefit, though, is the option to deploy subsystems independently of each other. [5] With subsystems now deployed to independent services that each have a smaller scope, redeploying any one subsystem incurs less overhead and so it becomes practical to redeploy each service more frequently. This enables teams to ship new features faster and reap the corresponding business benefits sooner.

More concretely, in our e-commerce example app, as soon as a feature in the orders service is ready, orders can be redeployed. As long as orders’s API remains the same before and after the deployment, other services need not even know that a redeployment took place. On the other hand, if notifications’s logic rarely changes, then that service can simply continue to operate untouched.

Each microservice can be independently redeployed

Independent redeployment also enables independent scaling. [6] If our orders service is the first to reach its capacity, then we can simply upgrade orders to a more powerful server or deploy more replicas of orders, without having to also replicate every other service. Yet again, as long as the replicated orders service retains the same API before and after scaling, the other services can continue to operate in the same way as though nothing happened. The result is fewer large-scale system-wide redeployments and higher utilisation of provisioned resources, leading to savings in engineering time and costs.

Each microservice can be independently scaled

3) Microservices challenges

We have now seen how the network boundaries between microservices result in several major benefits over the monolithic architecture. The network, however, comes with baggage, and relying heavily on it to communicate between subsystems introduces an entire new dimension of challenges.

3.1) Network unreliability and latency

Recall that in a monolith, subsystems are simply classes or modules that communicate through method invocations within the same process in memory. In contrast, in a microservices architecture, equivalent calls are now sent between services using HTTP requests and responses over the network.

Method calls are fast, but network hops are (relatively) slow

As any sufficiently heavy user of the internet will have experienced, the network is unreliable and has latency. That is, networks can disconnect for any number of reasons, and network traffic can sometimes take a long time to reach its destination. Even though in production, services are likely deployed to state-of-the-art data centers run by large cloud providers, network faults still can and do occur.

Such faults introduce a whole new class of problems for developers - not only do they have to ensure their service code is bug-free, now they also have to diagnose unexpected network faults, and add logic to service code that preempts network faults by providing compensating behaviors (e.g. displaying a ‘network down’ page to users, or retrying the same request a few seconds later).

3.2) Diagnosing network faults

Diagnosing a network fault can be especially cumbersome when a single workflow passes through multiple services. Consider a user placing an order on our e-commerce example app, and suppose the orders service needs to first update inventory in inventory, then create a shipment in shipping. This one workflow involves 3 services with at least 3 network hops between them. If the order placement eventually fails, what caused that to happen?

To find out, a developer would have to trace the user’s initial POST request through the entire system. Since each service generates its own logs, the developer would have to first access orders’s logs, track down the request that failed, follow the request to the next service (in our case, the inventory service), access inventory’s logs, and so on, until they pinpoint the exact request that failed. This can be a laborious and slow process.

3.3) Managing fault-handling logic

Other times, a network fault may be totally random, and a request should simply be retried again. But how long should the requesting service wait before retrying again? How many times should it retry before giving up? If too soon or too many, then all the retries could overwhelm the responding service. Such logic must be defined thoughtfully.

The next question becomes: where should all this logic be defined? For some teams, the first answer to this question is in HTTP client libraries that are imported into each service’s code. [7] So if the orders service is written in Ruby, then it would require a gem that provides a configurable client for making HTTP requests to other services. Another service written in Node might import a similar package into its code.

Often, these libraries can also handle logging, as well as other networking and infrastructure concerns, such as caching, rate-limiting, authentication etc.

Teams with more resources may go further, by having each service’s owner write a client library for every other service that calls it. This is already common practice when working with popular external APIs; for example, Stripe provides dozens of official and third-party libraries in different languages that abstract away the logic for calling its APIs. [8] Similarly, in a large team, each service’s owner may be tasked with writing a new client library for every requesting service that uses a different language.

shipping's owner writes a new shipping_client for every language

Needless to say, this solution becomes less and less manageable as the number of services grows. Every time a new service is built in a new language, every other service owner must write a new client library in that language. More critically, updating fault-handling logic now incurs a great deal of repetitive work. Suppose the CTO wishes to update the global defaults for the retry logic; developers would now have to update the code in multiple client libraries in every service, then carefully coordinate each service’s redeployment. The greater the number of services, the slower this process becomes. [9]

As more services are added, client libraries can get out of hand

4) Existing solutions

With microservices becoming increasingly popular, a number of solutions have emerged to help teams overcome these challenges. Here we explain how two of these solutions - the API gateway and the service mesh - compare with each other.

Both of these solutions in fact share the same building block - a proxy server.

4.0) Proxy server

A proxy is simply a server that sits on the path of network traffic between two communicating machines, and intercepts all their requests and responses. These machines could represent a client sending a request to another server, or for our purposes, two internal services communicating within the same architecture.

Proxy server can intercept and forward HTTP requests and responses

In the above diagram, orders does not send an HTTP request directly to shipping; instead, it addresses its request to a host belonging to proxy (i.e. proxy.com). In order for proxy to know that orders actually wants to send its request to shipping, orders must specify shipping’s host (i.e. shipping.com) in another part of the request e.g. in the Host header value.

When proxy receives a response back from shipping, it simply forwards the same response back to orders.

4.1) API gateway

4.1.1) API gateway features

At its core, an API gateway is simply a proxy server (more precisely, a ‘reverse proxy’ [10]). When used with microservices, one of its primary functions is to provide a stable API to clients and route client requests to the appropriate service. [11]

An API gateway proxies all incoming requests into a system

It is certainly possible to deploy microservices without an API gateway. In such an architecture, whenever the client sends a request, it must already know which service to send the request to, and also the host and port of that service. This tightly couples the client with internal services, such that any newly added services, or updates to existing service APIs, must be deployed at the same time as updates in the client code. Such an architecture can be difficult to manage, as clients cannot always be relied upon to update immediately (e.g. mobile apps cannot be easily forced to update); even if they can, doing so would still incur additional engineering that could be avoided.

Without an API gateway, a client must know the host port and path of every service it needs to call

With an API gateway, developers are largely free to update internal services while still providing a stable API to clients.

An API gateway provides a stable API for clients, even if services are upgraded, replicated, or removed internally

In addition to routing requests, the API gateway also provides one place to handle many concerns that are shared between services, such as authentication, caching, rate-limiting, load-balancing and monitoring.

An API gateway also provides one place to manage other networking concerns

In a way, an API gateway can be thought of as a receptionist at a large company. Any visitor does not necessarily have to know which employees are present in advance, or how different teams work together to complete specific tasks. Instead, they simply speak with the receptionist, who then decides, based on the visitor’s identity and stated purpose, which company employee to notify, and/or what access to grant to the visitor.

4.1.2) API gateway for service-to-service traffic?

Let us revisit the challenges that were described back in Section 3: 1) diagnosing faults in workflows that span multiple microservices, and 2) managing fault-handling logic that is similar across services.

If the API gateway already provides one place to manage networking concerns, perhaps it is already a sufficient solution to these challenges? For example, instead of deploying it as a ‘front proxy’ that sits in front of all services, we could deploy it in a different pattern than it was intended for - as a proxy that sits between services internally. Would this not already provide the one place to log all service-to-service requests and responses, and define fault-handling logic like retries?

Could we just deploy an API gateway internally between services?

In theory, this is certainly possible, but in practice, existing API gateway solutions are not ideal options for this.

Optimized to handle [client-server] traffic at the edge of the data center, the API gateway ... is inefficient for the large volume of [service-to-service] traffic in distributed microservices environments: it has a large footprint (to maximize the solution’s appeal to customers with a wide range of use cases), can’t be containerized, and the constant communication with the database and a configuration server adds latency.

- NGINX, maker of the popular open-source NGINX load balancer and web server [12]

In short, although the API gateway looks close to the solution we need, existing solutions on the market come built-in with many extra features that are designed for client-server traffic, making them a poor fit for managing service-to-service traffic.

API gateway's features for client-server traffic add unnecessary complexity when using it for service-to-service traffic

That is not to say a solution like an API gateway is completely out of the question. As we shall see in Section 5, the API gateway pattern was a major source of inspiration for Apex’s solution.

4.2) Service mesh

The service mesh is another existing solution to the challenges with microservices that were outlined in Section 3. As mentioned previously, it also builds upon the proxy server.

4.2.1) Sidecar proxies

The service mesh is a highly complex solution, and we once again approach it through the analogy of a company. Consider a large team of people (analogous to services) who all communicate directly with each other.

Communication in a large team (1): everybody talks to everybody

As the team size grows, team members will likely find themselves spending more and more time handling these scenarios:

[Retry logic] At any given time a team member may be off sick, so any other person who wishes to talk to them must retry again later.
[Rate-limiting] A team member may be temporarily working reduced hours, and can only handle a limited number of incoming messages.
[Caching] A team member may be asked for the same piece of information multiple times by other team members.
[Encryption] Each team member is required to only use secure communication channels provided by the company.
[Authorization] Some team members may be allowed to access confidential financial information, while others may not.
[Routing] Sometimes a team member may need a particular piece of information, but does not know who has it, and so has to try several different people before obtaining it.
[Logging] The company may wish to pull all messages from every team member’s inbox, to create a centralized record for auditing purposes.

Managing these communication-related issues would take away time and focus from each team member’s core responsibilities.

In this example, adding a service mesh is analogous to giving every team member a personal assistant (PA), who intercepts all incoming and outgoing messages and handles all the above tasks. This team structure would free team members from having to handle communication-related tasks, and allow them to focus more on their core responsibilities.

Communication in a large team (2): Every team member talks through their own personal assistant (PA)

In an actual service mesh, the PA would instead be a proxy server, known as a ‘sidecar proxy’. Each service is deployed alongside its own sidecar proxy, which intercepts all requests and responses to and from its parent service, and handles all the networking and infrastructure concerns we listed above, such as retry logic, rate-limiting etc. As a result, each service’s code can focus on its main business logic, while outsourcing networking and infrastructure concerns to the service’s sidecar proxy. [13]

In a service mesh, services talk through their own 'sidecar proxies'

4.2.2) Configuration server

In addition to the sidecar proxies, the service mesh has one other important component - a central configuration server.

Back in our hypothetical company, a configuration server is akin to a centralized folder containing data on team members and company policies e.g. who is on leave, who is working reduced hours, which secure channels to use, who has access to what information. Each personal assistant (PA) would have their own copy of this information to help them handle communication quickly, but whenever anything is updated in the centralized folder e.g. by the COO or HR Director, the changes are immediately sent to each PA, so that PAs always have the most up-to-date information in their own copies.

One centralized folder containing personnel info, with updates automatically copied to each PA's copy of the folder

In the same way, the configuration server in a service mesh provides one place to update network traffic rules, such as logic for retries, caching, encryption, rate-limiting, routing. The configuration server is the source of truth for this information, but each sidecar proxy also has a cached copy of the information. Whenever the configuration server gets updated, it propagates the changes to each sidecar proxy, which then applies the changes to its own cached copy. [14]

One configuration server containing all routes, retry logic, etc., with updates automatically pushed to each sidecar proxy's cached copy

4.2.3) Service mesh trade-offs

Again, let us revisit the challenges that were described back in Section 3: 1) diagnosing faults in workflows that span multiple microservices, and 2) managing fault-handling logic that is similar across services.

The service mesh provides a robust solution to these challenges. The configuration server provides one place to define and update fault-handling logic; each sidecar proxy can be responsible for generating logs and sending them to one place to be stored, and also for executing fault-handling logic. Moreover, without any single point of failure or one single bottleneck, the architecture is resilient and highly scalable. [15]

Service mesh as solution to microservices challenges

However, as with so many tools, rich functionality begets complexity. Implementing a full service mesh more than doubles the number of components in the architecture that must now be deployed and operated. In addition, both the sidecar proxy and its parent service are usually containerized to run alongside each other in the same virtual server. [16] If any existing service is currently deployed without a container, then developers must now containerize it and redeploy it. More domain expertise must be acquired, and significant engineering effort expended.

4.3) Summary

As we have seen, solutions certainly exist to handle the challenges we described with microservices. Each existing solution embodies a different set of trade-offs.

API gateways’ features are designed for client-server, not service-to-service, traffic.
Service meshes check all the boxes, but require teams to acquire more expertise, operate double the number of components, and redeploy existing services in a different pattern.

5) Design & architecture

5.1) Apex trade-offs

For some teams, neither an API gateway nor a service mesh provide the right set of trade-offs. Consider a small team that are just beginning to migrate their monolith to include a few microservices. For ease of deployment, most of the services have been deployed to Heroku, or another platform as a service (PaaS) solution.

It is likely that this team will have already experienced the challenges we mentioned back in Section 3: 1) diagnosing faults in workflows that span multiple microservices, and 2) managing fault-handling logic that is similar across services.

For this team, a solution with the following trade-offs are needed:

Optimized for handling service-to-service traffic
One place to aggregate logs and manage traffic rules
Simple to deploy and operate
Fewer built-in features are acceptable
Does not require changes to deployment pattern for existing services (e.g. does not require existing services to be containerized), since this may be difficult or not possible at all in PaaS solutions
Lower availability is acceptable
Lower scalability is acceptable

These are precisely the trade-offs we chose when building Apex.

5.2) Proxy server with middleware layers

Apex’s architecture includes 5 components:

Proxy server
Logs database
Configuration store
Admin API
Admin UI

Apex’s core component, the apex-proxy server sits on the path of network traffic between every pair of communicating microservices, such as that between orders and shipping above. In the case of systems with more than two services, the following diagram shows how Apex would be deployed.

Apex mediates internal traffic between every pair of services

Recall that an API gateway is just a proxy server that handles all client-server traffic coming into a system, and routes client requests to the correct service. In a similar way, Apex can be thought of as a stripped-down, internally deployed API gateway, which routes not traffic between clients and servers, but traffic between services.

Zooming further into apex-proxy, there are several middleware layers that each provide additional functionality beyond simple proxying, such as authentication, routing, retries and logging.

5.3) Logging and tracing service-to-service traffic

Since apex-proxy intercepts all network traffic between microservices, it is able to aggregate logs for every request and response, and send them to apex-logs-db to be persisted and queried in one place.

All requests and responses get logged to one place

Additionally, apex-proxy provides the ability to trace requests and responses that belong to the same request-response cycle. Any request that comes into apex-proxy is given an extra correlation-id HTTP header value (f84nw2 in the example diagram below), if it doesn’t have one already, before being logged. This same correlation-id value is then also included as the request is forwarded to the responding service. When a response comes back from the responding service, Apex adds this same correlation-id value to the response, before forwarding this updated response back to the requesting service. As a result, all requests and responses belonging to the same request-response cycle have the same correlation-id value when they are logged, making it easy to query them together.

Requests and responses in the same cycle get given the same correlation-id when logged

This same feature also makes it possible to connect requests and responses belonging to workflows that span multiple services. As long as each service adds some logic to propagate any correlation-id header value that already exists in incoming requests, then all requests and responses belonging to the same workflow will have the same correlation-id value in apex-logs-db.

In a multi-service workflow, all requests and responses get the same correlation-id when logged

Now, figuring out where a request failed within a workflow is just a matter of querying apex-logs-db for that one correlation-id value. This solves our first problem of diagnosing faults in workflows that span multiple microservices.

apex-admin-ui interface for querying apex-logs-db by correlation_id

Below, we demonstrate this feature on an actual deployed instance of Apex.

Actual Apex admin UI for querying apex-logs-db by correlation_id

5.4) One place to manage fault-handling logic

Similar to the service mesh, Apex also has a single configuration server, the apex-config-store, where developers can define logic for retries, routing etc. In this way, Apex can be thought of as a stripped-down service mesh.

apex-config-store contains the following configuration data:

service-credentials is used for authentication. It stores the list of service names along with their passwords. Every service that sends a request to Apex must authenticate itself with a token generated using its name and password.
service-hosts is used for routing. It lists the IP address or domain name where each service can be found.
default-default is used for defining global defaults for service-to-service traffic. In the above diagram, currently by default a request times out if a response is not received within 3,500 ms, and can be retried a maximum of 4 times. Each new retry attempt must wait, or ‘back off’, for 5,000 ms after the last failed request.
orders-shipping and shipping-inventory are examples of service-specific rules that override the global defaults:
- Whenever orders sends a request to shipping, requests time out after 5,000 ms, and can only be retried no more than 2 times, with a backoff of 2500 ms.
- When shipping sends a request to inventory, however, requests time out after 2,000 ms, and there shall be no retries at all.

For every request sent to Apex, apex-config-store is queried for authentication, routing and retry logic - in that order. Only after all three are complete do requests get forwarded onto the responding service.

apex-proxy queries apex-config-store for authentication, routing and retry logic on every request

With this one place to define and update configuration data, Apex’s architecture provides a solution to our second problem of managing fault-handling logic (as well as other network concerns) that is often similar across services.

Below is the actual Apex UI for defining retry logic for when the orders service calls the shipping service.

Custom configuration logic for when the orders service calls the shipping service

Applying the same company analogy that we used for the service mesh, Apex is comparable to having just one team assistant (apex-proxy) for the whole team, rather than one personal assistant per team member, mediate all communication between team members. Every time any team member needs to communicate with another team member, they send their messages through the team assistant. On every incoming message, the team assistant checks a centralized folder (apex-config-store) containing all the relevant information on team members and company policies, to verify the identity of the sender, determine who should receive the message, as well as how many times to retry should the first attempt fail.

5.5) Trading off availability and scalability

Though Apex provides a solution to the two microservices challenges that were described back in Section 3, it comes with trade-offs, namely lower availability, and lower scalability.

One of the strengths of the full service mesh is that there is no one component that sits on the path of all service-to-service traffic. If a sidecar proxy crashes or gets overloaded, only its parent service becomes unavailable, while the remaining services can continue to operate normally. With Apex, however, the apex-proxy becomes a single point of failure and traffic bottleneck. Any outage in apex-proxy will halt all service-to-service traffic and render the entire system unavailable.

Ultimately, there is an inherent trade-off between the number of proxies in the system (and hence availability and scalability), and how easy it is to deploy and operate the system. Apex and service meshes occupy opposite ends of this spectrum.

Apex is easier to deploy and operate, while service meshes offer higher availability and scalability

5.6) Transitioning to a service mesh

Despite the seemingly divergent set of trade-offs between Apex and the service mesh, Apex’s architecture is in fact acknowledged by several service mesh vendors as a possible transitional architecture on the journey toward a full service mesh. NGINX calls this architecture a ‘Router Mesh’ [17]; Citrix calls it a ‘Service Mesh Lite’ [18], and Kong calls it an ‘Internal API gateway’ [19].

Therefore, any team that adopts Apex’s architecture can rest assured that they are not taking a path that is mutually exclusive to eventually adopting a full service mesh. The truth is quite the opposite - this architecture is “relatively easy to implement, powerful, efficient, and fast”, and forms part of a “progression” toward a service mesh [20].

6) Implementation & deployment

When implementing Apex, we made technology choices based on the trade-offs we described in Section 5.1. In particular, we prioritized ease of deployment and operation over feature-richness, high availability and high scalability. The technologies we ended up choosing include Node.js and Express.js, TimescaleDB, Redis, React and Docker.

Apex architectural components use several technologies

Below, we briefly elaborate upon each of these choices.

6.1) `apex-proxy` - Node.js and Express.js

For the main proxy server, we had the choice between using any popular web development framework (e.g. Rails), and building atop an existing proxy (e.g. Envoy, NGINX). Since one of our design goals was to be ‘simply to deploy and operate’, we preferred a solution that did not come built-in with any extra features that are irrelevant to our target user. With this in mind, we decided on the “fast, unopinionated, minimalist” Express.js framework [21] built in Node.js, a language known for its ability to “handle a huge number of simultaneous connections with high throughput” [22] and widespread usage among developers.

6.2) `apex-logs-db` - TimescaleDB

The logs database is an instance of TimescaleDB

The request and response logs generated by apex-proxy are a type of time-series data. [23] To store them in one place, we chose TimescaleDB, a time-series database that can ingest data at a rate of more than 100,000 rows per second, even as a database reaches billions of rows [24]. This high ingestion rate mitigates the risk that writing logs to storage will become a bottleneck in the system.

6.3) `apex-config-store` - Redis

A Redis key-value store holds configuration data

One of Apex’s core features is providing service owners with one place to modify service information (e.g. register their service, generate new credentials for authentication), and update fault-handling logic (e.g. retry logic) for their own service. To enable this, we had several options for where to store the configuration data: 1) in an environment file that is loaded into memory when apex-proxy spins up, 2) in a file that is read by apex-proxy for every request, or 3) in an external configuration data store.

Option 1 of using an environment file was immediately ruled out, as it requires that the apex-proxy process be restarted every time the file is updated. Between the two remaining options, storing configuration in a file on disk leads to faster reads, since in general disk IO is faster than fetching data over the network. However, files can be easily corrupted, if say multiple processes write to the same file at the same time.

In the end, we decided on Option 3, and implemented a Redis key-value store that gets queried for configuration data on every request. Redis stores all its data in memory, and so enables reads at over 72,000 requests per second [25]. This somewhat makes up for Option 3’s slower read speed compared to Option 2. In addition, Redis persists data to disk once every second, ensuring that configuration data will remain intact even if the Redis instance crashes and must restart.

6.4) `apex-admin-api` - Node.js and Express.js

The admin API is another Node.js and Express.js application

For convenience, we built a REST API that enables users to programmatically query their logs in TimescaleDB and update config data in Redis (as opposed to having to SSH into those instances and issue commands in the terminal). This API also provides the option for admins to build additional UIs for different access roles e.g. a logs-only UI for users who are not authorized to update configuration data.

6.5) `apex-admin-ui` - React

The admin user interface was built with React

Finally, apex-admin-ui communicates with the apex-admin-api backend and provides service owners with a convenient way to register new services, edit existing service information, add and edit custom configuration, and query logs by correlation_id.

6.6) Deployment - Docker and Docker Compose

Each component deploys in its own Docker container

Installing and running five interconnected components will likely be a time-consuming process fraught with unpredictable environment-specific errors. Standing by our design goal of being ‘simple to deploy and operate’, Apex’s components are all containerized using Docker, and deployed in a coordinated fashion with Docker Compose. This ensures Apex’s components are all deployed in the same (containerized) environment for every user.

As shown below, deploying Apex with Docker Compose locally requires just one docker-compose up command.

Deploying Apex's components with Docker Compose

Apex’s documentation also provides step-by-step instructions for deploying Apex to AWS’s Elastic Container Service (ECS).

7) Implementation challenges

7.1) Logging large request and response bodies

Sending large request and response bodies from apex-proxy to apex-logs-db can add significant latency to request-response cycles, spin up long-running processes in apex-proxy that decrease its throughput, and fill up apex-logs-db far faster than necessary.

To solve these problems, we ultimately chose to avoid decompressing any log bodies that arrive in a compressed format, and send logs asynchronously from an in-memory queue.

7.1.1) Solution 1: keep bodies compressed

Quite simply, sending compressed bodies means fewer bytes transmitted and stored. For a typical web page that arrives at apex-proxy in a compressed format (e.g. CNN’s homepage), we found that sending the compressed body to apex-logs-db typically took less than 1 second, compared with 5-10 seconds for the decompressed version.

While keeping bodies compressed was a sensible choice, it came with the trade-off of inconvenience for users of apex-logs-db, who must now take the extra step to decompress bodies to make them human-readable again.

7.1.2) Solution 2: queue logs and send them asynchronously

Queuing logs to be sent asynchronously has the effect of decoupling writes to apex-logs-db from request-response cycles through apex-proxy. If apex-proxy happens to receive a particularly large response body that must be logged, it can simply enqueue this log, and move on to forwarding the response back to the requesting service and then on to processing the next request. The request-response cycle can complete regardless of when, or whether, the log eventually gets sent to apex-logs-db.

Adding a queue in this way also lays the foundation for a further optimization - sending logs to TimescaleDB in batches. TimescaleDB’s own docs explain that this could further increase its data ingestion rate. [26]

However, this solution comes with two trade-offs. The first is that several large log queues within concurrent processes could consume a lot of memory, straining the host server. Given TimescaleDB’s high ingestion rate, we made the decision to accept this trade-off, in the belief that the logs will dequeue fast enough to avoid hitting such a limit.

The second trade-off is that should apex-proxy crash, any logs that have not yet been dequeued would now be lost from memory. Since each individual log is relatively unimportant data, we also deemed this an acceptable trade-off.

7.2) Persisting logs and config data in containers

While containerizing TimescaleDB and Redis made deployment simpler for users, it also increased the risk of losing logs and configuration data. This is due to the ephemeral nature of Docker containers. [27]

Fortunately, Docker containers support ‘volumes’, a mechanism to persist data to a container’s host filesystem, and beyond the container’s lifespan. [28] When deploying containers locally with Docker Compose, enabling this feature requires just an extra line of code in the docker-compose.yml configuration file.

Deploying Docker containers to AWS’s Elastic Container Service (ECS), though, requires more care. ECS offers two launch types [29]: the EC2 launch type provides more control, by allowing developers to choose the type and quantity of EC2 instances to provision for their containers. Its downside is that it requires more steps to deploy. The Fargate launch type, in contrast, abstracts away the entire resource-provisioning process, reducing the deployment process to running just 6 or so commands. Crucially, only the EC2 launch type supports Docker volumes. [30]

We had initially wanted to support the Fargate launch type, in alignment with our design goal of being ‘simple to deploy’. However, it was clear to us that the ability to persist logs and configuration data beyond the lifecycle of individual containers will be important for any Apex user, and in the end we spent a significant amount of extra time configuring Apex to support the EC2 launch type.

8) Future work

8.1) Handle bursty traffic with a queue

Apex, as is, represents a single point of failure for a system. To protect it from being overwhelmed by bursty traffic, a best practice is to deploy a FIFO (‘first in, first out’) queue in front of apex-proxy. Another option we are considering is to deploy a standard queue without FIFO guarantees, which offers higher throughput rates, but requires additional middleware in apex-proxy to ensure messages are consumed in the right order.

8.2) Back up logs and config

While Docker volumes offer strong persistence guarantees for Apex’s logs and configuration data, the compute instances (e.g. AWS EC2) hosting the containers are ephemeral and not suitable for long-term storage. For users who need even stronger persistence guarantees, apex-logs-db and apex-config-store can be configured to periodically back up data to a cloud storage service (e.g. AWS S3 and S3 Glacier).

8.3) Reduce latency by caching config in `apex-proxy`

Currently, every incoming request to Apex triggers multiple reads from the Redis apex-config-store. As Apex is designed for small teams with a finite and small number of services, most of these reads from Redis will be for the same configuration data. Therefore, we can significantly reduce the read rate by caching a copy of all configuration data in memory within apex-proxy, and updating the cache whenever any user makes a change on Redis.

9) References

Icons made by Freepik, Those Icons, Kiranshastry, Smashicons, Pixel perfect, and Eucalyp from www.flaticon.com.

Centralized logging and tracing for your microservices

Manage all fault-handling logic e.g. retries in one place

Deploy to Docker containers with a few commands