Event Driven/Edge-Native model for managing Security Compliance Posture

"Autonomy is the foundation of creativity and efficiency in decision making"

One of the major reasons for the success of some modern applications and orchestrators like Kubernetes is that they are based on event-driven design models combining the benefits of edge computing. In simple terms, what it means is, there is no heavy reliance on request-response type message exchange between a controller and edge devices(nodes) wherein the sole responsibility is on the controller which manages the devices(including configurations being pushed).

Instead, in event-driven architecture, an event(generated from various sources happening) could trigger actions to 'happen' on remote edge devices. These edge devices could do this by monitoring various event sources which are external to the device. We may call this an "edge-native" way to manage the states of the devices/nodes.

The advantages of this approach are:

  • Inherently this makes the architecture more modular and hence more scalable and secure.

  • The edge devices are autonomous. Issues like communication link breakage between the controller and edge devices have no impact on an edge device's state management.

  • In cases like IoT, edge computing lowers network traffic and connectivity cost. Instead of sending a huge chunk of data to the Cloud, only relevant information is sent over the network.

  • Security configuration drifts on the edge devices can be fixed in a self healing way(instead of a controller pushing the changes periodically)

There are some interesting advances in the way security posture or compliance is ensured on edge devices. Some examples are Event driven ansible[EDA], some cloud posture management solutions by vendors.

Traditional way:

The traditional approach for security or posture management is to use some mechanism wherein a controller host pushes configuration changes on the end devices. This could be an ansible or puppet host. The limitations of this are:

  • In case of a drift on the edge devices from a standard configuration, there is a time lag to fix it. This is because the configuration is pushed on a periodic basis via a scheduler (For example a cron in Gnu/Linux systems).

  • Inefficiencies grow as the number of assets(and the networks) increase, due to additional overheads such as making sure all controllers are in sync, can reach the nodes at all times etc.

  • Fixing the issues takes more time(high MTTR).

  • There is no institutional knowledge. Each team/subgroup tries to fix the issues for the assets they own in individual ways. [Note that this is more of a workflow issue in organizations than a technology issue but is amplified by a centralized management approach than an event-driven approach]

Event Driven compliance posture management:

If let us say, instead of a controller directing the edge devices on "what to do", we enable/empower edge devices themselves to make decisions and self-heal based on some event, there is a whole paradigm shift.

There can be two scenarios in this case:

  • A) The edge devices herein could monitor the events happening in the environment, Or..

  • B) There could be a messaging system set up collecting events from all devices and then feeding this data to an automation framework (that takes decisions based on context/metadata - a decision engine), which then directs the systems in real-time to fix the issues like drift. [ Some cloud posture management solutions achieve this by tying the events to message queues and then enforcing policies/plugging drifts]

And this capability brings in a myriad of innovative solutions by writing 'rules' in the system to handle not only configuration drifts which could include:

  • Event-driven patch management. All patches are applied based on a specific notification - say for example an intranet webpage updated with a specific message, a file uploaded in a central shared folder can trigger edge nodes to trigger patch updates themselves as and when they notice any such above event.

  • Ensuring CIS benchmarks are consistently applied on all devices in real-time. Essentially this means the state of the device is maintained at all times - and not checked/fixed just periodically.

  • Codify the operational knowledge into institutional knowledge to remediate technical issues in real time.

  • Faster resolution(MTTR)

  • Greater observability. One could for example write a rule in EDA that fixes a drift in real-time, but if the drift occurs more than say X number of times during the day - create a Splunk alert as this could indicate potential malicious activity on an edge device.

On a parting note, what I think one of the most essential element in this approach is - the decision-making logic is baked into the edge-devices [ say a rulebook in case of Event Driven Ansible]. The rulebooks tells a node what events to flag and how to respond to them.

References:

https://github.com/ansible/event-driven-ansible#readme