Cost Effective Log Management - Ideas beyond log filtering

Raj Menon

1/4/20255 min read

In regulated industries such as finance, log management costs can be prohibitively high due to compliance requirements and industry standards. While numerous articles on this topic offer cost-effective log management strategies, they often focus on basic housekeeping activities such as identifying useful vs. redundant logs, implementing automated retention policies, and alerting for log spikes. This blog diverges from this approach, instead exploring novel data engineering techniques to tackle the issue of log explosion.

Cost Drivers:

Before exploring cost-saving measures, it's essential to understand the factors driving up logging costs. The two primary drivers are log volume and velocity. Log volume increases storage costs, while high velocity raises compute costs, particularly when systems perform in-line data enrichments such as geo-ip lookups , log categorizations to name a few. This is crucial, as auditors often require timely log delivery for security incident response and managing critical system availability in near-real-time. Furthermore, disaster recovery scenarios impose significant back pressure management requirements on log data during pipeline disruptions, adding to messaging infrastructure costs, such as Apache Kafka. Third-party systems with consumption-based licensing models further compound costs, where licenses are applied to ingested rather than stored logs. This can lead to higher costs for organizations that have to constantly filter out non-desirable content. Cloud providers' costs are driven by data access depending on the storage tier where logs are persisted. The choice of detection/analytics tool also impacts costs, with options including appliance-based licensing or software licenses tied to memory or CPU usage.

As you can see there are myriads of avenues where one can easily bleed money trying to manage an enterprise logging solution.Now let’s dive into how we can tame some of these cost drivers.

Log value & time based data retention policies:

With the emergence of data lakehouse and warehouse technologies, logging can be approached as a multi-faceted strategy. Configuring multiple logging destinations for high- and low-value datasets, as well as storing the same data set across different timeframes. While doing so, one can also manage a unified log access experience for the end consumer which we will cover in the Query Virtualization section. To manage this, categorize low-value logs - typically stored for audit purposes - to minimal parsing routes to cost-effective storage layers like the data lakehouse. Similarly, archive data whose value diminishes with time (e.g., 14-day or 30-day old) to low-cost storage, ensuring durability without sacrificing availability. Note that "low-cost" doesn't imply inferior solutions; query response times can still be fast (seconds or minutes), thanks to proper data partitioning, indexing, and optimized compression algorithms like zstd.

For high-value datasets accessed frequently and recent (<14-day or 30-day old) used in detection logic, store them on your preferred solution - such as Splunk, Elasticsearch, or OpenSearch. These solutions often offer storage tiering (hot, warm, cold, frozen), improving efficiency. However, compression ratios might still be limited to reduce costs per dollar of stored logs. To overcome this, combine tiering strategies with detection tools (e.g., Elasticsearch/OpenSearch) and lakehouse solutions for different use cases (detection, audit and analytics).

Query Virtualization & User Experience:

Before diving into additional log storage techniques, it's essential to consider user experience, particularly with differential data retention policies. One significant challenge is the need for analysts to navigate complex mind maps to determine which log source is persisted on which technology stack. This requires them to learn multiple access patterns, such as Splunk's SPL, Elasticsearch's DSL, and SQL, which can be overwhelming.

Query virtualization offers a solution, although it may not be extremely efficient at this time. Some solutions, like OpenSearch Direct Queries, enable querying data stored in OpenSearch formats alongside data from other sources, such as AWS S3 and CloudWatch, without requiring analysts to switch between multiple analytics tools. Similarly, building Elastic plugins can query data from Elasticsearch while accessing Apache Iceberg data .Trino, another query virtualization tool, can be used to access data using SQL across multiple RDBMS & NoSQL platforms.However, token-based searches using SQL, commonly employed by SREs and InfoSec analysts on unstructured log data, can lead to inefficiencies in query response times. Data stored in a data lakehouse often doesn't lend itself well to these searches, unless secondary indexing techniques on unstructured data are employed, which is beyond the scope of this article to elaborate.

Despite these challenges, vendors are recognizing the importance of managing log volume costs and are investing heavily in integration techniques that address these issues. As they continue to innovate, we can expect improved query virtualization capabilities, enabling analysts to access data from multiple sources more efficiently using minimal toolkit.

Log Summarization:

In addition to query virtualization, log summarization is another technique that can help reduce the impact of cost on log volume. Log data inherently contains repetitive information, making it an ideal candidate for summarization techniques. By applying log summarization strategies at various lifecycle stages, organizations can minimize storage costs while preserving valuable insights.

One such technique is intelligent log reduction, which can be achieved through tools like Logslash. This open-source project allows for the ignoring of low-value fields, providing counts of records that make up an event instead of individual records. Additionally, denormalization techniques can be employed to preserve these low-value fields, as described in their GitHub page.

Another log summarization technique is Index rollups, commonly used in solutions like Elasticsearch. This approach involves aggregating and consolidating data after it has been persisted, making it easier to manage historical log data that are highly granular in nature. For example, assuming log events are recorded every second of the day with minimal value associated with each document, Index rollup strategies can aggregate these records on an hourly or daily basis, preserving counts of records and important fields for historical analysis purposes.

By implementing log summarization techniques like Logslash and Index rollups, organizations can effectively reduce log volume costs while maintaining valuable insights into their data.

Allow vs. Deny List filtering:

When it comes to log filtering, there are many articles available that provide guidance on determining what is valuable from an information security or site reliability engineering perspective. However, I'd like to emphasize the importance of adopting an allow list-based approach over a deny list-based approach. As system owners, we often have a good understanding of which logs can be discarded (e.g., error, malformed logs), but identifying the logs that are actually required for storage and analysis can be challenging.

This is where taxonomy exercise comes in – a crucial step in log classification that allows us to identify and categorize events of interest. By undertaking this exercise, we can categorize logs into different categories, such as file operations, authentication, network/firewall, DNS, etc. This classification exercise serves two purposes: it enables the build of an efficient log ingestion mechanism based on an allow list principle, and it provides a solid foundation for audit responses.

Audit responsiveness is particularly important, as auditors typically seek assurance that organizations are collecting all necessary security events to ensure their security or availability posture. In today's highly variable endpoint ecosystem, with multiple Info Sec tools and their varying log formats, standardization becomes increasingly difficult. By adopting a taxonomy-driven approach, we can mitigate this challenge and provide auditors with the assurance they need.

While taxonomy exercises can be time-consuming and challenging, it's essential to tackle them in pieces by focusing on the most noisy and voluminous log sources within our ecosystem. For example, we can start with OS logs, firewall logs, Kubernetes logs, or any other log source that is particularly complex or difficult to manage.

By taking a systematic approach to taxonomy and log classification, we can create a robust and efficient logging solution that meets our organization's needs while also supporting audit requirements.

Conclusion:

Managing logs in an ever growing, complex and highly distributed systems environment is a tough business. My goal here was to not trivialize that problem statement by providing a set of axioms, but to describe techniques and issues , that can help manage this problem a little bit better. If you have additional tips or techniques, that can help tame this beast, please feel free to share your thoughts below!

Get in touch

If you have any questions or comments on this topic, feel free to get in touch at :

Email

info@b4bytes.com