Demystifying Log Collection in Cloud-Native Applications on Kubernetes
Episode #36: A DevOps Engineer’s Guide to Aggregating Distributed Logs with Kubernetes and Elastic tools.
This article continues our journey into Observability and logs.
So far, I have only examined logging from an application developer's perspective.
Here, we will analyse the problem from the perspective of a DevOps engineer who wants to aggregate logs from multiple distributed sources into a single log analysis tool, like Elasticsearch.
The main focus here is on "distributed."
If you were analysing a single log file from a local application, you wouldn't even need a "log collection" process. You could use command line tools to tail a log file and grep for what you need. I might consider writing an article on this topic, but it won't be our primary focus.
You can swap the word "distributed" with microservices or "cloud-native applications running on Kubernetes over many nodes".
My over two years of experience in the Cloud Native Observability team at Elastic, the company behind Elasticsearch, has equipped me with the knowledge to guide you through the complexities of log collection in distributed applications.
In this article, we will explain how Kubernetes gathers logs from a container and makes them available to possible log collection agents in an effective manner that mitigates some of the typical challenges.
The insights you'll gain from this article can be easily applied to various solutions, regardless of their relation to Kubernetes or Elastic Observability tools.
The sections of this article are the following:
Observability and logging
Log collection in distributed applications
Pitfalls of log collections
Log writing in Kubernetes
Nginx's logs in Kubernetes
Log collection in Kubernetes
Node logging agent architecture
Logging agents
Observability and logging
If you are new to Observability and logging, I have already written several articles on this topic in this newsletter.
In Exploring Structured Logging with Slog in Go, I have already introduced the concept of Observability as a modern version of monitoring.
I have also discussed how it's best to abandon print statements and move to proper Observability with logs, metrics, and traces in Observability 101: A Beginner's Journey Free of Print Statements.
Finally, in Master Observability with Logs: An In-Depth Guide for Beginners, I have debated how logging fits in the picture of Observability, how it compares to other signals (aka Metrics and Traces) and how to get started with logging in Golang.
Log collection in distributed applications
What does log collection in distributed applications
mean in this article?
The assumption here is that:
You have a non-trivial amount of logs constantly generated by a distributed application, distributed microservices, or many different applications running on a platform like Kubernetes that you want to analyse in a central location.
Let's unwrap the previous sentence:
distributed
logs are generated in many different machinesnon-trivial amount
, more than just a few hundred megabytes.continuously
, you have a constant stream of datacentral location
is the most important of the previous assumptions. Unless you centralise your logs, you won't be able to correlate them to investigate a root cause analysis.
If even one of those assumptions is not valid in your use case, you might not need the solution described here.
If you have, for example, a few hundred-megabyte files as part of a support case, you could use some CLI tools or ingest your logs as a one-off into an SQL database. There is no need for a complex log collection process.
Also, if you didn't need to correlate logs from many sources, you could store each application's logs in the node where the logs are generated. You could SSH into the node with the faulty application and inspect the logs where they leave.
Pitfalls of log collection
Before discussing the clever pattern used by Kubernetes, I will briefly mention some of the problems with log collection.
Filling local disk
We have been writing logs into local files since application logging was a thing.
The biggest problem was implementing log rotation based on time or size to avoid filling the machine disk.
You could keep the last 30 days or 500MB of logs, depending on which limit was hit first.
Log rotation on local files has been a problem that has been solved for many years, but it still requires some logging agent to keep truncating the log files.
Pushing logs to a log sink
It doesn't matter if your application sends logs directly to blob storage, a remote log analyser or any intermediary like a log collection tool.
The problem is still the same.
Your application needs to be "aware" of the log sink.
This means that you might need an application library to deliver log messages to that particular sink or some config management (or maybe just a config file) to store the location of the log sink.
What if the log analyser tool is temporarily not available? You will lose logs or need to implement some in-memory buffer to account for this inevitable situation. A retry logic with an exponential back-off will be advisable in this case.
You might want to change the location of the log sink or the type of sink, requiring maybe a restart of your application.
All the previous requirements affect the application in terms of performance both on CPU (sending the logs) and memory usage (storing them temporarily in a memory buffer).
To avoid blocking your application's main logic (which is definitively not advisable), you will have to delegate sending logs to a parallel async thread so that it becomes an asynchronous operation.
As you can see, sending logs directly from your application to a log sink makes the application unnecessarily complicated.
3rd party apps
To make things worse, you might need to handle log collection from 3rd party software differently from applications where you have control over the source code.
You won't be able to push logs from 3rd party applications, so you will need to collect them first and then move them to the log sink.
Log writing in Kubernetes
There must be a better way to deal with log collection.
Kubernetes has adopted quite a clever trick to avoid some of the pitfalls presented above.
Kubernetes expect your application (or any 3rd party application) to write its logs to the standard output (aka stdout) and standard error (stderr) device files.
If you are not already familiar with what stdout and stderr are from Linux, you might want to understand those topics at Standard streams before moving forward with the rest of the section.
By default, Kubernetes redirects the content of these two stream files for a running container to the same file /var/log/containers/<container_id>.log
on the node where the application is running.
At the same time, Kubernetes allows a user to access those same log files with the command:
kubectl logs <pod-name> -n <namespace>
You can even stream the content of log files as they are written with the following command writes it:
kubectl logs -f <pod-name> -n <namespace>
Thanks to a Kubelet, Kubernetes also implements log rotation based on either time or size to keep each container log to a limited size and avoid overfilling the local disk.
Nginx's logs in Kubernetes
What if instead of writing to stdout/stderr, the application that you want to collect logs from writes to one or more local files (like, for example, access.log and error.log in the case of Nginx),
In this case, the container image must be modified to redirect those files to stdout and stderr when running on Kubernetes.
Let's analyse how this is done in Nginx's Dockerfile.
ln -sf /dev/stdout /var/log/nginx/access.log
ln -sf /dev/stderr /var/log/nginx/error.log
As you can see, two symlinks are created to redirect access.log and error.log, respectively, to the stdout and stderr device files.
The above modifications are done so that Kubernetes can treat every application in the same way.
Log collection in Kubernetes
Kubernetes effectively separate the log writing step from the log collection, solving most of the pitfalls of log collection that we analysed above.
The application writes logs to a file synchronously without any performance problems or a need to handle complex retry logic and in-memory buffers; then, logs can be collected and shipped asynchronously by another tool to a log sink.
Log collection can be delegated to any 3rd party tool and handle both your applications (the ones you have control over the source code) in the same way that it handles 3rd party applications.
As you can imagine, the benefits are enormous in terms of performance, application complexity and application size.
Moreover, logs don't have to be streamed at the same speed as they are generated. They can be batched for improved performance.
Any failure handling technique we discussed above still needs to be handled by a log collection tool, but that's not your application problem.
There are different ways in which log collection can happen.
Node logging agent architecture
The most famous cluster-level logging architecture in Kubernetes is called node logging agent.
One pod per node runs separately from your application and has full access to the underlying Kubernetes node where it is running.
This pod needs full read access to the location in the Kubernetes node where container logs are stored (aka /var/log/containers/
).
Given that you need one (and only one) log agent per node, this requirement can be easily implemented in Kubernetes as a DaemonSet.
The log collection agent will likely use some form of watermarking to record the last log event correctly delivered on disk. If the node agent fails and is restarted, it can start from where it left off.
Logging agents
Historically, Elastic has created or contributed to many tools to collect logs from different sources and dispatch them to Elasticsearch for storage and analysis.
In order of when those tools were developed:
Logstash, a server-side data processing pipeline that ingests, transforms, and ships data to various destinations.
Elastic Beats, lightweight data shippers in Golang that send data from edge devices to Elasticsearch or Logstash.
Elastic Agent, a single, unified way to add monitoring for logs, metrics, and other types of data to a host. Under the hood, it uses Elastic Beats but also provides extra features for host security.
Otel log collector is a vendor-agnostic way to receive, process, and export telemetry data. Elastic did not originally write this tool but has contributed heavily in recent years.
The Otel collector is likely the future of log collection, not just at Elastic but also in the Observability field.
OpenTelemetry (OTel) is more broadly an open-source observability framework that provides vendor-agnostic APIs, libraries, agents, and instrumentation to generate, collect, process, and export telemetry data (traces, metrics, and logs) for analysis in observability backends.