Cybersecurity Lakehouses Greatest Practices Half 4: Knowledge Normalization Methods


On this four-part weblog collection “Classes realized from constructing Cybersecurity Lakehouses,” we’re discussing various challenges organizations face with knowledge engineering when constructing out a Lakehouse for cybersecurity knowledge, and provide some options, suggestions, tips, and finest practices that we’ve used within the area to beat them.

In half one, we started with uniform occasion timestamp extraction. In half two, we checked out how one can spot and deal with delays in log ingestion. And in half three, we tackled how one can parse semi-structured, machine-generated knowledge. On this last a part of the collection, we focus on some of the essential points of cyber analytics: knowledge normalization utilizing a typical info mannequin.

By the top of this weblog, you should have a stable understanding of among the points confronted when normalizing knowledge right into a Cybersecurity Lakehouse and the methods we will use to beat them.

What’s a Frequent Info Mannequin (CIM)?

A Frequent Info Mannequin (CIM) is required for cyber safety analytics engines to facilitate efficient communication, interoperability, and understanding of security-related knowledge and occasions throughout disparate techniques, functions, and units inside a company.

Organizations have completely different techniques and functions that generate logs and occasions in numerous constructions and codecs. A CIM gives a standardized mannequin that defines widespread knowledge constructions, attributes, and relationships. This standardization permits analytics engines to normalize and harmonize knowledge collected from disparate sources, making it simpler to course of, analyze, and correlate info successfully.

Why use a Frequent Info Mannequin?

Organizations use quite a lot of safety instruments, functions, and units from completely different distributors, which generate logs particular to their respective applied sciences. Normalizing knowledge right into a recognized set of constructions with constant and comprehensible naming conventions is essential to allow knowledge correlation, menace detection, and incident response features.

As a working instance, suppose we wished to know which techniques and functions consumer ‘Joe’ has efficiently authenticated towards inside the final 30 days.

To reply this query and not using a single mannequin to interrogate, an analyst could be required to craft queries to go looking tens or a whole bunch of logs. Every log file reviews the username and the results of any authentication outcomes (success or failure) as completely different area names with completely different values. The app area identify is also completely different in addition to the occasion time. This isn’t a workable answer. Enter the Frequent Info Mannequin and the normalization course of!

Common Information Model

The picture above exhibits how disparate logs from many sources filter occasions into event-specific tables, utilizing recognized column names, permitting a single easy question to reply the query as soon as knowledge has been normalized.

Issues to think about when normalizing knowledge

There are a variety of circumstances that ought to be accounted for when normalizing disparate knowledge sources right into a single CIM-compliant desk:

Differing Column Varieties: Unifying disparate knowledge sources and particular occasions into the CIM (event-driven) desk could have clashing knowledge varieties.

Derived Fields: The normalization course of usually requires new fields to be derived from a number of supply columns.

Lacking Fields: Fields could unexpectedly not exist or comprise null values. Make sure the CIM caters to lacking or null worth knowledge varieties.

Literal Fields: Knowledge to help a goal CIM area could must be created, or the sector could must be set to a literal worth equivalent to “Success” or “Failure” to make sure a unified search functionality. For instance (the place motion=”Success”)

Schema Evolution: Each knowledge and the CIM could evolve over time. Guarantee you could have a mechanism to supply backward compatibility, particularly inside the CIM tables, to cater for modifications in knowledge.

Enrichment: CIM knowledge is usually enriched with different context equivalent to menace knowledge and asset info. Contemplate how one can add this info to supply a complete view of the occasions collected.

Which mannequin ought to I select?

There are a lot of widespread Info fashions to select from when constructing out a Cybersecurity Lakehouse, from open supply fashions to vendor-specific publically obtainable fashions. The choice on what to make use of relies upon primarily in your particular person use case.

Some issues are:

  • Are you augmenting Delta Lake with one other SIEM or SOAR product? Does it make sense to undertake that one for simpler integration?
  • Are you solely constructing a Cybersecurity Lakehouse for a particular use case? As an illustration, do you solely wish to analyze Microsoft endpoint knowledge? If that’s the case, does it make sense to align with Microsoft ASIM mannequin?
  • Are you constructing out a Lakehouse as your group’s predominant cyber analytics platform? Does it make sense to align with an open supply mannequin like OCSF or OSSEM or construct your individual?

Finally, the selection is organizational-specific, relying in your wants. One other consideration is the completeness of the mannequin you select. Fashions are generic and can possible require some adaptation to suit your wants; nonetheless they need to primarily help your knowledge and necessities earlier than you start adopting the mannequin, as mannequin modifications after the very fact are time-consuming.

Ideas and finest practices

Whatever the mannequin you select, there are a number of suggestions to make sure gaps don’t exist in your total safety posture.

  • Most queries rely closely on entities. Supply host, vacation spot host, supply consumer, and utility used are possible probably the most looked for columns in any desk. Guarantee these are well-mapped and normalized.
  • Fashions usually present steerage on area protection (necessary, really helpful, non-compulsory). Guarantee at a minimal that necessary fields are mapped and have knowledge integrity checks utilized tfor a constant search surroundings.


Frequent Info Mannequin-based tables are a cornerstone of an efficient cyber analytics platform. The mannequin you undertake when constructing out a Cybersecurity Lakehouse is organization-specific, however any mannequin ought to largely be appropriate to your group’s wants earlier than you start. Databricks has beforehand solved this downside for patrons utilizing the ideas outlined within the weblog.

Get in Contact

If you wish to study extra about how Databricks cyber options can empower your group to establish and mitigate cyber threats, contact [email protected] and take a look at our Lakehouse for Cybersecurity Purposes webpage.