CI/CD

Security Testing

A practical guide to embedding security checks into delivery so risks surface early and are handled with clear ownership.

Security testing in DevSecOps is not a phase at the end of development , it is a continuous practice integrated into the delivery pipeline at every stage where problems can be caught cheaply. The shift-left philosophy recognizes that a vulnerability discovered during code review costs almost nothing to fix; the same vulnerability discovered after deployment may cost weeks of remediation, regulatory scrutiny, and loss of customer trust. Effective security testing means choosing the right tools for each stage, making results actionable, and maintaining the discipline to act on findings rather than accumulate them.

Learning objectives

What you should be able to do after reading.
  • Match common security test types to the risks they are best at finding.
  • Understand how quality gates and false positives affect pipeline value.
  • Place security checks where they give useful feedback without slowing delivery unnecessarily.

At a glance

Fast mental model before you dive in.
Static coverage
  • SAST
  • SCA
  • Secret scanning
Runtime and asset checks
  • DAST
  • Container scanning
  • IaC scanning
Decision control
  • Quality gates
  • Triage
  • False positives

Core idea

Different security test types are suited to different phases of the development lifecycle and different categories of risk. Static analysis is best run at commit time because it requires only source code. Dynamic analysis requires a running application and is most useful in staging environments. Dependency scanning can run at any time and should run on every build. Container scanning is most useful when the image is built. Each test type finds different problems and has different false positive rates. Knowing which tool to reach for and when is a skill as important as knowing how to use the tools.

The relationship between security testing and security monitoring is often confused. Security testing looks for vulnerabilities in the software before it reaches users, it is a delivery control. Security monitoring detects attacks against the software after it has been deployed, it is an operational control. Both are necessary, and gaps in security testing increase the burden on monitoring. But monitoring cannot compensate for systematically not testing, it only detects exploitation after the fact.

Proportionality matters. Not every change requires every security test, and adding expensive slow tests to every commit quickly creates a pipeline that developers learn to ignore or work around. The right approach is to match test intensity to change risk, a documentation change needs no security scanning, a change to the authentication module warrants thorough static analysis, DAST against the login flow, and SCA for any new dependencies. Progressive security testing, where more thorough checks run as changes progress toward production, balances speed and rigor.

Security test results are only useful if someone acts on them. A finding that is detected, triaged, tracked, and left unaddressed for six months is effectively the same as a finding that was never found. The organizational discipline to close the loop, from finding to owner to fix to verification, is as important as the technical capability to find problems in the first place.

Test strategy

  • Apply SAST and secret scanning to every commit or pull request. These are fast and run on static code with no infrastructure required.
  • Apply SCA on every build. Dependency vulnerabilities are disclosed continuously and must be checked against the current state of the codebase.
  • Apply DAST, container scanning, and IaC scanning in staging or pre-production environments, where a full system is available to test.
  • All findings must have a designated owner and a documented response path: 'we found it' without 'someone will fix it by date X' is not a functional security posture.

Baseline

  • Define which finding severities block a release and which require review without blocking, and enforce this consistently rather than making per-release exceptions.
  • Tune tools to the codebase. Disable rules that consistently produce false positives and enable rules relevant to the languages and frameworks in use.
  • Create a formal suppression process. Findings can be suppressed only with a documented justification and an owner, not silently or permanently without review.

Signals to watch for

Patterns worth investigating further.
  • Findings appear without a clear owner or remediation path.
  • Tests are added, but their results do not affect release decisions.
  • High false-positive rates lead teams to ignore useful alerts.

DEEP DIVE

SAST

Static application security testing analyzes source code, bytecode, or compiled artifacts without executing the application. It works by parsing the code into an abstract syntax tree (AST), building a model of data flow through the application, and applying rules (patterns) that detect security-relevant conditions such as user-controlled data flowing into a database query without sanitization (SQL injection), user input reflected into HTML output without encoding (XSS), or hardcoded credentials in string literals.

SAST is well-suited to finding structural vulnerabilities that appear as consistent code patterns. SQL injection, command injection, path traversal, deserialization vulnerabilities, use of deprecated cryptographic algorithms, hardcoded secrets, and missing input validation. It struggles with vulnerabilities that only appear in runtime context. Authorization failures, business logic errors, race conditions, and configuration-dependent behavior. SAST results require human review to distinguish between real vulnerabilities and false positives.

Popular SAST tools include Semgrep (fast, rule-based, highly customizable with community rule sets for dozens of frameworks), SonarQube (broad language support, integrates with many CI systems, tracks findings over time), and CodeQL (GitHub's query-based analysis that models the application as a database and supports complex cross-function vulnerability queries). Each has different rule depth, performance characteristics, and integration complexity.

The most common mistake with SAST is running it with default rules and accepting all findings uncritically. Default rule sets are designed to be broad rather than specific to any given codebase, which typically produces high false positive rates. Effective SAST requires tuning. Disabling rules that consistently fire on non-issues in the codebase, enabling rules for the specific frameworks and libraries in use, and establishing a baseline of acceptable findings before enforcing gates.

DAST

Dynamic application security testing exercises a running application from the outside, treating it as a black box. The tester (automated tool or human) sends requests, observes responses, and looks for behaviors that indicate vulnerabilities. SQL error messages in responses (suggesting SQL injection), reflected input in HTML (suggesting XSS), unexpected redirects, verbose error messages, and missing security headers. DAST can find vulnerabilities that only manifest at runtime. Including injection flaws that involve multiple application components, authentication and session management issues, and server-side request forgery.

DAST tools typically work in two modes. Passive scanning (observing traffic without sending malicious payloads, to identify configuration and header issues) and active scanning (sending crafted attack payloads to probe for vulnerabilities). OWASP ZAP is a popular open-source tool that supports both modes and can be run headlessly in CI pipelines. Burp Suite is the industry standard for manual and semi-automated testing by security professionals, offering deep inspection and custom attack scripting.

The quality of DAST results depends heavily on the quality of the test environment. A staging environment that uses a different database, has different configuration, or lacks the same session handling as production will produce results that don't accurately reflect production behavior. Authenticated DAST, where the tool logs in with a test user account and explores authenticated functionality, finds significantly more vulnerabilities than unauthenticated scanning but requires more setup and maintenance as the application's authentication mechanism changes.

A common misconception is that DAST 'tests the whole application' and therefore provides comprehensive security coverage. DAST has limited coverage of application functionality it cannot discover or access. Unexposed API endpoints, functionality behind complex authentication flows, business logic not accessible through standard UI paths, and code paths that require specific user state. DAST complements SAST and manual testing but does not replace them.

SCA

Software composition analysis inventories the third-party libraries and components in a codebase and checks them against databases of known vulnerabilities. Modern applications commonly include hundreds of direct dependencies and thousands of transitive ones. Libraries that the dependencies depend on, which the application's own code never directly calls. SCA examines the full dependency graph, including transitive dependencies, and identifies components with known CVEs, outdated versions, risky licenses, and abandoned packages.

CVSS (Common Vulnerability Scoring System) scores are the standard metric for vulnerability severity, ranging from 0.0 to 10.0. A critical CVSS score (9.0-10.0) indicates a severe vulnerability, but CVSS alone does not determine exploitability in a specific context. A critical CVE in a library component that is never called by the application, or that requires local access to exploit remotely, is less urgent than its score suggests. Reachability analysis, determining whether the vulnerable code path is actually called by the application, is available in some SCA tools (Snyk, GitHub Advanced Security) and significantly reduces false-positive-like findings.

Popular SCA tools include Dependabot (GitHub's built-in dependency update service), Snyk (broad language support, reachability analysis, actionable remediation suggestions), and OWASP Dependency-Check (open-source, integrates with many CI systems). Each tool has different coverage of language ecosystems and different approaches to presenting findings. License compliance scanning is often a secondary function of SCA tools, identifying dependencies with licenses that conflict with the organization's distribution model.

The most common tension in SCA is between the security benefit of updating dependencies and the operational risk of breaking changes. A critical CVE in a major framework dependency requires immediate attention, but updating that dependency may break the application in ways that require significant testing effort. The practice of keeping dependencies current through automated, incremental updates (Dependabot PRs merged regularly) is much safer than deferring all updates until a critical CVE forces an emergency upgrade.

Secret scanning

Secret scanning detects credentials, tokens, API keys, private keys, and other sensitive values that have been committed to source code repositories or appear in build artifacts. Secrets committed to a repository, especially a public one, must be assumed to be compromised immediately, because they may have been indexed by search engines, cached by GitHub, or observed by bots that scan public repositories for credentials within seconds of publication.

Detection techniques include pattern matching against known secret formats (GitHub tokens match a specific regex, AWS access keys have a recognizable structure, private keys begin with a PEM header), entropy analysis (high-entropy strings are likely random and thus possibly cryptographic material), and machine learning models trained on large datasets of known secrets. GitHub Advanced Security, GitGuardian, TruffleHog, and gitleaks are popular tools that combine multiple detection techniques.

Pre-commit hooks can scan staged files before a commit is made, providing the earliest possible detection. However, pre-commit hooks run on the developer's local machine and can be bypassed intentionally or accidentally disabled. CI-based scanning is more reliable as a control because it is harder to bypass. It runs on every push to the repository regardless of local configuration. Historical scanning searches the full git history for secrets that were committed and then deleted (deletion does not remove a secret from history), which is important for auditing repositories that may have had secrets committed in the past.

When a secret is detected in a repository, at any point in its history, the correct response is immediately rotate the credential (generate a new one and revoke the exposed one), audit logs to determine if the credential was used by anyone other than the intended user, and investigate how it was committed (was it a mistake, a test credential, or a production credential?). Removing the secret from git history (using git-filter-repo) reduces future exposure but does not undo any access that already occurred with the exposed credential.

Container scanning

Container scanning inspects container images for known vulnerabilities in the packages and libraries they contain. It operates at two levels. OS-level scanning, which examines the packages installed by the distribution's package manager (apt, rpm, apk), and application-level scanning, which examines language-specific packages (npm, pip, maven, gem). A comprehensive scan covers both levels, because an OS package with a known CVE is just as exploitable as a vulnerable npm package in many attack scenarios.

The base image is typically the largest source of vulnerabilities in a container image. An image based on ubuntu:20.04 that has not been rebuilt in three months may have dozens of known CVEs in its OS packages, even though the application itself has no new code changes. This is why regularly rebuilding images from updated base images, not just when application code changes, is an important part of container security hygiene. Pinning to a specific base image digest and having the pipeline automatically test when the base image is updated is the production-grade approach.

Trivy is a widely used open-source container scanner that checks OS packages, language dependencies, and configuration issues in a single tool. Commercial alternatives include Snyk Container, Anchore, and registry-integrated scanners in AWS ECR, Google Artifact Registry, and GitHub Container Registry. The choice of tool affects coverage, update frequency of vulnerability databases, CI integration quality, and the ability to write custom policies.

A critical distinction is between 'the image has CVEs' and 'the image has exploitable vulnerabilities in context.' A container image may have dozens of CVEs in rarely-used OS packages that are never reachable from any external input. Vulnerability severity in the scanner output is a starting point for triage, not a final verdict. Teams that apply scanner output directly to release gates without contextual triage often find themselves in one of two failure modes. Blocking too much (high false positive rates cause developers to distrust and work around the scanner) or allowing too much (the severity threshold is set too high to catch real risks).

IaC scanning

Infrastructure-as-code scanning analyzes cloud and platform configuration files (Terraform, CloudFormation, Kubernetes YAML, Bicep, Pulumi) for security misconfigurations before they are applied to real infrastructure. This is the IaC equivalent of SAST. Finding problems in declarative definitions while they are still cheap to fix, rather than after the misconfigured resource has been deployed and potentially exposed data or become an entry point.

IaC scanners look for a well-known set of high-risk patterns. S3 buckets with public access enabled, security groups with open ingress on port 0.0.0.0/0, databases without encryption at rest, IAM policies with wildcard permissions (Action: '*'), missing MFA enforcement, storage volumes without backups, and Kubernetes Pods with privileged: true or broad hostPath mounts. These are the same patterns that cloud security posture management (CSPM) tools detect in live environments, but IaC scanning catches them before deployment.

Popular IaC scanning tools include Checkov (broad IaC format support, 1000+ built-in checks, custom policy support in Python), tfsec (Terraform-focused, fast, good CI integration), KICS (Keepig Infrastructure as Code Secure, open source, broad format support), and Terrascan. The right tool depends on the IaC framework in use, the CI system, and whether custom policies are needed for organization-specific security standards.

The key distinction between IaC scanning and cloud security posture management (CSPM) is timing. IaC scanning operates on code before deployment, it is a preventive control. CSPM tools (AWS Security Hub, Prisma Cloud, Wiz) scan live cloud resources after deployment, they are detective controls. Both are valuable. IaC scanning prevents misconfigurations from becoming live resources. CSPM detects misconfigurations that bypassed IaC scanning, were applied manually, or drifted from the declared state after initial deployment.

Quality gates

A quality gate is a policy decision point in the pipeline that evaluates security test results and determines whether the build should proceed. A finding that matches the gate criteria blocks the pipeline, one that does not match passes through (possibly with a warning). Quality gates make security testing actionable. Instead of producing a report that may or may not be read, the pipeline stops until the finding is addressed.

Gate design requires balancing strictness with practicality. Blocking on every critical CVE, even for a library component in a deeply nested transitive dependency with a theoretical exploit, will quickly produce a pipeline that cannot deploy. A pragmatic approach considers the combination of severity (CVSS score), exploitability (reachability, attack vector, availability of exploits), and business impact (what service is affected, what data could be exposed). Gates should also consider exceptions, a known-vulnerable component with an accepted risk and a tracked remediation timeline should not block every build.

Time-based exception management is an important feature for operational SCA gates. When a new vulnerability is disclosed, organizations reasonably need time to assess impact, test patches, and schedule remediation. A gate that provides a configurable grace period, 'block if this CVE is more than 30 days old and unaddressed', allows teams to work at a reasonable pace while still ensuring that vulnerabilities do not accumulate indefinitely. The grace period should be shorter for critical issues and longer for lower-severity ones.

Alert fatigue is the enemy of functional quality gates. When gates fire constantly with low-signal findings, developers learn to ignore them, suppress them without review, or apply pressure to relax the policy. The right response to excessive gate failures is not to lower the threshold but to investigate why the gate is firing. Are the rules well-tuned for the codebase? Are findings being addressed? Is the right tool being applied to the right stage? A gate that rarely fires but fires meaningfully is more valuable than one that fires constantly and is routinely bypassed.

False positives

A false positive in security testing is a finding that the tool reports as a vulnerability but which does not represent a real security risk in the specific context of the application. Common causes include. SAST tools flagging a data flow as dangerous without considering that the input is validated upstream, SCA tools flagging a CVE without knowing that the vulnerable code path is never called, and DAST tools flagging a reflected value as XSS without verifying that the context is already safely encoded.

False positives have a direct cost. Each false positive requires time to investigate, triage, and suppress. In high-volume environments where a scanner produces hundreds of findings per build, the aggregate cost of triaging false positives exceeds the cost of the real vulnerabilities it finds. The result is one of two failure modes. Teams spend most of their security review time on findings that turn out to be non-issues (wasted effort), or teams stop reviewing findings altogether (security blindness).

The right response to a false positive is to suppress it explicitly with documented justification, not to silence the tool or disable the rule. A suppression should record. What the finding is, why it is not a real vulnerability in this context, who made that determination, and when the determination should be reviewed again (in case the context changes). Some tools support suppression via code comments (Semgrep's nosemgrep directive, SonarQube's NOSONAR comment) or via configuration file suppressions that apply to specific file patterns.

The distinction between a false positive (the tool is wrong about the risk) and an accepted risk (the tool is right about the risk, but the team has decided not to fix it for documented reasons) matters for audit and compliance. A false positive suppression says 'this is not a vulnerability.' An accepted risk says 'this is a real vulnerability that we have assessed and accepted under specific conditions, with a documented owner and review date.' Conflating the two creates false confidence, an accepted risk that is suppressed as a false positive may never be revisited even when the conditions that justified acceptance have changed.