Chaos Testing in Log Management - A Data Engineering Challenge

Raj Menon

1/3/20247 min read

When it comes to log collection no matter the use case (Security Information and Event Management,SIEM or Site Reliability Engineering, SRE), there is growing recognition that they need to be treated as a data concern. Hence the qualification of this title. With that out of the way, let’s dive into the topic.

My team and I were tasked with transforming the log collection framework for our organization. To set some context, we are talking about a system that has to handle over a million EPS (events per second), with terabytes worth of data flowing in each day. The system should also hydrate not only the primary log collection store but also the data lake house that is built around it. The source systems are also varied with many systems predominantly sending data via Syslog using TCP or UDP protocols, API based ingestion mechanisms, logs being supplied via message brokers etc. The schemas were also unpredictable for many of these source systems. Finally, to ensure continuity of business we had to also manage the current log collection mechanism, with an identical profile, until we switched over. This was a multi year journey with the volume parameters of our design changing every few weeks or months until we finished our transformation.

The idea of “rolling back” to the legacy infrastructure was practically impossible. This is one of the cardinal rules of change management, which unfortunately we knew would be hard to honor. Few reasons why that was the case:

The investments to embark on this change journey ran into millions of dollars worth of software, hardware and human resource time.
The current infrastructure was being continuously hammered with growing volume and variety of log integrations that it was only a matter of time it would keel over.
Finally and most importantly, the idea of log accessibility and resiliency were simply not baked into our legacy solution to service our end consumers.

To not being able to move to our modern system would have affected operations in a manner that was simply risky and unacceptable to our business partners and would have caused catastrophic failures in the long run. It is the last factor that made it imperative we transition to the new system smoothly and without many issues.

So how does Chaos testing help us attain our goals ? The idea of Chaos testing is more popular in the software engineering development process vs. in the big data management discipline. In fact the notion of testing is still developing in the big data landscape. Here I would be candid to admit that we could not follow all the principles of chaos engineering particularly in our log collection framework, but our experiments did reflect aspects of what one might require to build confidence in our future state systems and to ensure a successful delivery of our end product.

I would like to walk through some of those principles and their application in our journey:

A) Build a hypothesis around Steady State Behavior:

Here is where we defined the successful operations of our system based on our customer expectations or requirements. In the area of log management, one of the metrics that defines steady state is the latency of log sources. While there are over a hundred log sources we collect, we identified a set of them which were critical for our customers operations. The latency of these log sources were itself determined by the detection frequency , the time duration in between alerts, that are triggered from the data in these sources. Another metric, data durability was also an important concern that we had to address. As I mentioned earlier, we get data from both TCP and UDP protocols and so it was a given that there would be certain amount of log loss in the system. However, the goal for us was to minimize this log loss within a certain fraction of the total log volume that we collect. This parameter can be shredded based on a single source that feeds the system. There were other interesting metrics that we considered as well e.g. log attribution to a certain source etc. However, we will stick to the two metrics described above for the purposes of this write up.

B) Vary Real-world Events

The most common occurrence of failures in our environment or if I can extrapolate for any log collection system, would be increase in log latency or issues with data durability because of unpredictable spikes in log volumes and break down of the log collection mechanism. The ultimate reflection of these issues would be missed alerts within the defined SLAs for our customers. Anecdotally, in the area of information security, the value of data or logs decreases exponentially with time. We defined maximum amount of time that our customers can function without availability of logs, additionally also understanding compensatory tools that can be utilized up until the time when the data is not available. Few of the ways that we experimented with log latency was by playing with the amount of compute/storage resources available to the log(or data) processing framework and identifying the breaking point where the latency of the logs becomes unacceptable.Here data processing framework can represent any data transformation mechanism that you use for parsing , enriching or curating(e.g. schema mapping) your log data. Another way to simulate log latency was to stop the data processing framework itself for a set amount of time. Let’s assume the recovery time objective of your customers is 24 hours. You can halt your data processing framework for maybe 12 hours and see if your pipeline can catch up within a 24 hour window. This means, recovering not only the back logged data that was not available during the 12 hour window but also capturing the data that is flowing presently as you recover your backlog. You can design the experiment above as you see fit. What I expressed was just an example of experiments that we could run.

Designing experiments for data durability becomes even trickier. This is simply because, it’s difficult to peg the entry point for your logs and the amount of control you have on the said entry points. e.g. When you act as a central log collection team , collecting logs from a server that belongs to a different SRE organization becomes interesting as that server is simply not within your control. If that server is chatty and the logs rotate faster than you can collect, immediately you experience data loss. You can always negotiate local log retention policies with those SRE teams, however what we will focus today would be on data durability issues that occurs after the data or logs hit your collectors which is very much within your control. In order to maintain line speed, our systems relied on memory queues to pass data between various data pipeline stages. Here we would deliberately crash the pipeline components and identify if there was any form of resiliency(i.e. back pressure management) to recover lost data as well as measure the proportion of lost data. Once we recognized data loss scenarios to be beyond acceptable thresholds, we looked for alternatives such as disk based queues to manage data processing.

One other important consideration while running these experiments was to make sure that we were not just recovering the data but also the alerts that were configured on top of this data. This was easily adjusted by tuning the alerts on the log collection date by our system vs. the actual event occurrence date. This in itself can be discussed at length but for the sake of brevity we will stop here.

C) Run Experiments in Production

One thing going for us in our transformation journey was that we were building a brand new system. This allowed us to construct a production environment which while active was also not operational. While this situation does not completely satisfy the chaos engineering principles for ongoing production experimentation ,we will revisit that situation at the conclusion of this section. It however created an interesting challenge. We could not just do a production launch of the new system while decommissioning the old one simultaneously, for various operational and audit reasons. For one, we did not know if the new system could handle our production load in practice even though our MVP implementation and back of the envelope calculations indicated we could. There required some degree of risk management baked into the deployment process even though I mentioned change management was going to be a one way street. So we decided to go for a “soft launch” of our new production environment. The idea here was to mimic the exact production workload into the new system and iron out issues before we actually go live. This definitely put some strain on managing two replica systems (the legacy and the modern solution), while we ran our experiments as described above, looked for customer feedback etc. We could also contrast the performance of the two systems both operationally and functionally in this setup.

In order to manage ongoing experimentation with production workloads, we created a miniature replica of the production environment. While it won’t share the exact infrastructure profile of the production environment, we will be able to duplicate production traffic into this environment. This by no means represents a “blue-green” deployment model nor a classic dev-test-production model. But it would still allow us to design chaos experiments against production data patterns for a few days worth of logs even though it may not represent the exact production infrastructure profile. This for us from a risk standpoint was an acceptable trade off.

D) Minimize Blast Radius

In the field of software engineering, microservices allow minimization of blast radius by supporting classic design philosophies such as “high cohesion” and “low coupling”. In the world of data management especially log management, we could either treat all logs pouring into your system like a giant monolithic segregated by “source” field or shred log data based on sources and build lifecycle management policies against individual sources. By way of an example, imagine dumping your operating system , firewall, proxy logs etc. into one giant index or table, vs. creating individual indices for all these sources. Such segregation at the data management layer allows one to build not only the storage component (i.e. the index or table) as its own service, but also the underlying processing layers (i.e. the ETL components, message broker components etc.) as their own services specifically catering to that functional unit of data.This by extension, allows us to design our chaos experiments, affecting a tiny subset of the entire ecosystem vs. the entirety of our ecosystem.

The approach that I described in sections C & D, provides us some cover to run chaos experimentation on data engineering based problems, with due consideration towards minimizing negative customer experience.

Conclusion:

In my opinion the data management field does not lend itself exactly to the same patterns of web or middleware application development. However we can borrow practices and design philosophies from these work streams and leverage those in the big data landscape of log management, data lake house applications, you name it. The creator of data pipelines will always have a bias in terms of testing systems without destroying the steady state nature of their business. Putting monitoring alone to determine failures, only allows one to catch the symptoms of data problems. In order to build confidence into the resiliency and quality of systems that we deliver to our customers, adopting chaos engineering principles is a great addition to your development practice. The area of log management is a very mercurial landscape with the data profile(volume, velocity, variety etc.) changing rapidly due to the introduction of new workloads, products, additional systems etc. Hence running chaos experiments is not a one time exercise but warrants continuous execution.