Secrets, Identity, Runtime

Runtime Security

How to observe active workloads, detect suspicious behavior, and respond before small issues become incidents.

Runtime security focuses on detecting and responding to threats while workloads are actively running in production. Unlike static analysis or pre-deployment controls, runtime security provides a last line of defence: catching attacks that slipped through earlier controls and minimising the damage before they can cause significant harm.

Learning objectives

What you should be able to do after reading.
  • Explain what runtime monitoring captures and why it matters.
  • Distinguish runtime events, audit logs, and health checks.
  • Describe a practical path from alert triage to incident response.

At a glance

Fast mental model before you dive in.
Detection layer
  • Runtime monitoring
  • Falco
  • Runtime events
Evidence layer
  • Audit logs
  • Health checks
  • System context
Response layer
  • Alert triage
  • Containment
  • Incident response

Runtime Threat Detection

Runtime threat detection monitors the behaviour of running workloads for signs of compromise. Unexpected process spawns, unusual system calls, network connections to new destinations, file writes in sensitive directories, and privilege escalation attempts. Unlike static analysis or pre-deployment controls, runtime detection operates on what workloads actually do, not on what they are configured to do.

Modern runtime detection tools use eBPF (extended Berkeley Packet Filter), a Linux kernel technology that safely runs monitoring code inside the kernel, observing every system call with minimal overhead. Falco, Tetragon, and similar tools are built on eBPF and can instrument all container activity on a node without modifying the containers themselves.

Alert fatigue is the primary operational challenge in runtime threat detection. A busy container generates millions of system calls per minute, only a tiny fraction are security-relevant. Effective detection requires investing in rule tuning, suppressing known-safe baseline patterns, and focusing alerts on high-confidence signals that represent real threats rather than normal operation.

Container Runtime Security

Containers run on a shared kernel, making kernel-level attack surface a genuine concern. Security profiles such as seccomp (system call filtering), AppArmor, and SELinux restrict what a container process can do at the kernel level. A container that only needs to serve HTTP traffic should not be able to call mount() or create raw sockets, seccomp profiles enforce this by blocking system calls the application does not legitimately need.

Additional runtime hardening includes running containers as non-root users, setting the root filesystem as read-only, dropping all Linux capabilities beyond the minimum required, disabling privilege escalation (no-new-privileges), and avoiding host namespace sharing. These settings are specified in the Kubernetes pod security context and in Dockerfile USER and COPY instructions.

Container breakout detection is an important component of runtime security. A breakout occurs when a process inside a container escapes the namespace boundary and accesses the host system. Detection signals include processes accessing host filesystem paths outside their expected mounts, unusual capabilities being acquired, and unexpected interaction with the container runtime socket (/var/run/docker.sock or /run/containerd/containerd.sock).

Immutable Infrastructure

Immutable infrastructure is the practice of deploying containers and VMs that are never modified after deployment. When a change is needed, a new image is built, tested, and deployed to replace the old one. The running instance is never patched in place, never logged into for debugging, and never updated with live config changes. This approach eliminates configuration drift and makes unauthorized changes detectable.

File integrity monitoring (FIM) complements immutable infrastructure by detecting when files in the container or VM are modified at runtime. If a container's filesystem changes after deployment in ways that were not expected (new executables, modified libraries, new configuration files), this is a strong signal of compromise. Runtime security tools can alert on unexpected filesystem modifications in sensitive directories.

The practical implication of immutable infrastructure for incident response is significant, if a container is suspected to be compromised, the correct response is to preserve the container for forensics (capture filesystem snapshot, logs, and network connections) and then terminate and replace it with a clean instance. Attempting to patch or clean a potentially compromised running container is risky and unreliable.

Security Observability

Comprehensive observability is a prerequisite for effective runtime security. Logs, metrics, and traces must be collected from every workload, centralized in a queryable system, and retained for sufficient time to support incident investigation. In Kubernetes environments, the Kubernetes audit log (recording all API server requests) is a critical additional data source alongside application logs and runtime security events.

Security observability requires selecting the right log sources and ensuring they are protected. Application logs reveal what the application itself is doing. Kubernetes audit logs reveal control-plane activity (who created which pods, who modified which RBAC policies). Runtime security logs (from Falco or similar) reveal kernel-level events. Network flow logs reveal communication patterns. Together these layers provide the full picture of an attack.

Shipping security logs to an external, immutable destination is essential for their trustworthiness. An attacker with sufficient cluster access can modify in-cluster logs. Exporting logs to an S3 bucket with Object Lock, a managed SIEM, or a cloud-native log service before any tampering can occur ensures the logs remain reliable evidence for forensic investigation and compliance reporting.

Incident Response in Cloud-Native Environments

Responding to an incident in a Kubernetes environment requires different approaches than traditional server forensics. Containers are ephemeral, by the time an alert fires, the container may have already been replaced by a new healthy instance, destroying the evidence of the compromise. Incident response processes must account for this by collecting evidence immediately and automatically when a suspicious event is detected.

Containment in cloud-native incident response uses the platform's own controls. A suspected compromised pod can be isolated by applying a NetworkPolicy that blocks all ingress and egress, by cordoning the node it runs on, or by adding a label that removes it from the service's load balancer endpoint set. These actions can be automated in response to high-confidence Falco alerts.

Post-incident review in cloud-native environments focuses on systemic improvement rather than individual blame. The key questions are how did the attacker gain initial access, what controls failed or were absent, what controls did succeed in limiting the blast radius, and what changes to detection rules, security policies, or infrastructure configuration would prevent recurrence. These findings feed back into the security programme.

Signals to watch for

Patterns worth investigating further.
  • A workload suddenly starts processes, network connections, or file writes that are outside its normal pattern.
  • Security alerts are noisy because no one has defined what is expected versus suspicious.
  • There is no clear playbook for isolating a container, node, or service during an incident.

DEEP DIVE

Runtime Monitoring

Runtime monitoring is the practice of observing what workloads actually do while running in production, rather than only analyzing what they are expected to do from their source code or configuration. The goal is to detect behaviour that deviates from an established baseline. Unusual system calls, unexpected network connections, file modifications in protected directories, and privilege escalation attempts. Runtime monitoring catches attacks that bypass all pre-deployment controls.

Modern runtime monitoring uses eBPF (extended Berkeley Packet Filter), a Linux kernel technology that allows programs to run safely inside the kernel and observe system calls, network events, and filesystem operations with minimal overhead. eBPF-based tools instrument the kernel directly, meaning they observe all container activity on a node without modifying the containers themselves, without injecting agents into them, and without requiring container image changes.

Effective runtime monitoring requires an established behavioural baseline. What system calls does this service normally make? Which files does it read or write? Which network connections does it initiate? Deviations from this baseline trigger alerts. Some tools learn baselines automatically by observing the application during a training period, others use manually defined allow rules that capture expected behaviour explicitly.

Runtime monitoring is the last line of defence against post-deployment attacks. Even a container that passed every security scan, was built from a minimal base image, and was deployed with a restrictive security context can still be compromised if a zero-day vulnerability is discovered after deployment. Runtime monitoring detects the post-exploitation behaviour. Reconnaissance commands, outbound connections to new destinations, and lateral movement attempts that all appear after initial access is achieved.

Falco

Falco is an open-source cloud-native runtime security tool created by Sysdig and now a CNCF graduated project. It uses a rules engine to detect anomalous behaviour in running containers, Kubernetes pods, and Linux hosts. Falco ships with an extensive library of default rules covering the most common attack patterns and can be extended with custom rules written in a YAML-based syntax.

Falco works by consuming kernel events through eBPF or a kernel module, enriching them with Kubernetes metadata (pod name, namespace, container image, labels), and matching them against its rule set. When a rule matches, Falco generates a structured alert that includes all the relevant context, which container, which process, which system call, which user, and the full command line that triggered the event. This context is essential for rapid triage.

A canonical Falco rule detects a shell spawned inside a container. In a production environment, containers should never run interactive shells, if a shell process appears inside a running container, it almost always indicates either an attacker who achieved code execution and is running commands, or a developer who has exec'd into a container for debugging (which itself violates the immutable infrastructure principle). Either event warrants investigation.

Deploying Falco effectively requires tuning. The default ruleset is comprehensive and will generate noise in environments that do certain things Falco treats as suspicious by default (running privileged containers, using the host network namespace, certain system administration tasks). The process of tuning is to disable rules that generate acceptable and understood exceptions in your environment, and to add custom rules for application-specific behaviours your team wants to detect. A well-tuned Falco deployment surfaces high-confidence signals with low false-positive rates.

Runtime Events

Runtime events are the individual observations recorded by a runtime security system, a system call made by a process, a file opened or written, a network connection established, a new process spawned, a user created, a privilege escalated. Each event is a data point, the security value comes from correlating events over time and recognising patterns that indicate malicious activity rather than legitimate operation.

High-signal events demand immediate attention because they have few legitimate explanations in a production environment. These include a process writing to /etc/passwd (credential modification), a container spawning a shell, a service making DNS queries to newly seen domains, a process attempting to load a kernel module, a container mounting the host filesystem or the container runtime socket, and any process running with capabilities beyond those declared in the pod specification.

Event volume is a fundamental challenge in runtime security. A busy container may generate millions of system calls per second, but only a tiny fraction are security-relevant. Runtime security tools apply aggressive filtering, discarding events that match safe baseline patterns and only forwarding anomalous events for alerting. Getting this balance right (low false positives, low false negatives) is the ongoing operational challenge of a mature runtime security programme.

Events from runtime security tools should be integrated with the rest of the security data pipeline. Correlating a runtime alert (a shell spawned in a container at 03:00) with a network alert (that container made an outbound connection to an unknown IP at 03:01) and an authentication log (the container's service account read five Kubernetes secrets at 03:02) creates a narrative of the attack that no single event source can provide. This correlation is the core function of a SIEM in a cloud-native environment.

Audit Logs

Kubernetes audit logs record every API request made to the Kubernetes control plane. Who made the request (user, service account, or system component), what resource was accessed (pod, secret, configmap, role binding), what operation was performed (create, get, update, delete), and what the outcome was (allowed or denied). These logs are distinct from application logs and runtime event logs, they describe control-plane activity.

Kubernetes audit policy defines which events are logged and at what detail level. The None level discards an event entirely. Metadata logs the request metadata but not the request or response body. Request logs the request body. RequestResponse logs both request and response bodies. Security-relevant events (modifications to RBAC policies, access to secrets, pod creation, changes to network policies) should be logged at the Request or RequestResponse level to capture the full content of what changed.

Audit logs are critical for detecting compromised service accounts and insider threats. If a service account that normally only reads from a specific ConfigMap suddenly starts listing all secrets across all namespaces, creating new pods with elevated privileges, or modifying ClusterRoleBindings, the audit log captures all of this activity in detail. Without audit logging enabled, these control-plane activities are completely invisible to the security team.

Audit logs must be shipped outside the cluster to be trustworthy. An attacker with cluster-admin access could modify or delete in-cluster log storage. Shipping logs to an external, append-only destination (an S3 bucket with Object Lock, a managed log service, or a SIEM) before they can be tampered with ensures they remain reliable evidence. Audit log retention periods should align with incident response and compliance requirements, typically 90 days to one year.

Health Checks

Health checks are application endpoints that report whether a workload is functioning correctly. Kubernetes uses liveness probes (which determine whether a container should be restarted), readiness probes (which determine whether a container should receive traffic), and startup probes (which give slow-starting applications extra time before liveness checks begin). From a security perspective, health checks serve additional purposes beyond availability. They can remove compromised or misbehaving containers from production traffic and trigger restart cycles that replace potentially modified containers.

Liveness probes cause Kubernetes to restart containers that are stuck or corrupted. If an attacker's exploit code causes the health check endpoint to stop responding, Kubernetes restarts the container, which in an immutable infrastructure model means starting fresh from the original image. This provides a weak but real resilience benefit, some attacks that corrupt a container's state will inadvertently cause the liveness probe to fail, triggering a clean restart.

Readiness probes control traffic routing at the service level. A container that fails its readiness probe is removed from the Service's endpoint set and stops receiving new requests. This matters for security because a compromised container that becomes unresponsive or starts behaving erratically will be isolated from production traffic, limiting the blast radius while investigation and remediation proceed. Existing in-flight requests may be affected, but new requests route to healthy instances.

A common mistake is implementing health check endpoints that always return HTTP 200 without inspecting any application state. A trivial liveness probe that returns success regardless of actual application health provides no useful signal. It will not catch stuck states, corrupted configurations, or missing dependencies. Meaningful health checks verify that critical dependencies are accessible and that the application's core functionality responds within expected parameters.

Alert Triage

Alert triage is the process of evaluating incoming security alerts, determining their priority, and deciding on the appropriate response. In a runtime security context, this means examining Falco events, anomaly detection findings, and audit log events to distinguish genuine threats from false positives, and escalating high-confidence threats for immediate incident response while suppressing low-confidence or known-safe patterns.

Effective triage requires context. An alert that a shell was spawned inside a container means different things depending on which container it is (a production API server versus an administrative toolbox), what time it happened, which user or service account triggered it, and whether correlated alerts preceded it (was there an unusual network connection before the shell appeared?). Security teams that invest in alert enrichment and correlation dramatically reduce mean triage time and improve decision quality.

Alert fatigue is the primary enemy of effective security operations. When alert volumes are high and false-positive rates are high, analysts learn to ignore alerts. The solution is systematic tuning. Suppress known-safe patterns in detection rules, increase alert quality rather than quantity, and focus engineering effort on the highest-confidence signals. A smaller number of high-confidence alerts that are always investigated is more valuable than a large volume of alerts that are routinely ignored.

Triage runbooks document the expected investigation steps for each alert type. For a Falco alert 'shell spawned in container,' the runbook might specify identify the pod and namespace, check Kubernetes audit logs for recent API activity by the pod's service account, review network flow logs for outbound connections from the pod, capture a filesystem snapshot of the container if it is still running, and escalate to incident response if the shell cannot be explained by a known-safe maintenance activity. Runbooks reduce cognitive load during high-stress events and ensure consistent handling.

Incident Response Basics

Incident response (IR) in cloud-native environments differs from traditional server IR in several fundamental ways. Containers are ephemeral. The evidence of a compromise may be gone before an investigation begins if the container has already been restarted or replaced. The infrastructure is programmable, an attacker with sufficient access can cover tracks by deleting logs, removing pods, or reverting configuration changes. IR processes must be designed for this reality.

Evidence preservation is the first priority when a compromise is detected. Because containers can terminate at any time, evidence collection must be automated and triggered immediately when a suspicious event is detected. This means capturing the container's filesystem using a snapshot or forensic copy, exporting all available logs (application logs, runtime security events, Kubernetes audit logs) before they roll off, preserving network traffic if a packet capture was in progress, and documenting the Kubernetes resource state (pod spec, service account bindings, active network policies).

Containment in cloud-native IR uses the platform's controls rather than network firewall rules. A suspected compromised pod can be isolated by applying a NetworkPolicy that blocks all ingress and egress (allowing the pod to continue running for forensics while preventing further harm), by cordoning the node to prevent new workloads from scheduling there, or by removing the pod from the service's load balancer endpoints so it receives no new traffic. These actions can be automated in response to high-confidence alerts.

Post-incident review drives systematic improvement. After each incident, the team should answer. How did the attacker achieve initial access, what controls were missing or misconfigured, what controls succeeded in limiting damage, and what changes to detection rules, runtime policies, or infrastructure configuration would prevent recurrence. The answers become action items in the security backlog. This improvement loop is what distinguishes mature security programmes from those that respond to the same types of incidents repeatedly.