Security

Chaos Auditor

Identifies fragile code paths — unhandled errors, env vars without defaults, untested functions, and single points of failure — and outputs an HTML resilience report.

SKILL.md

SKILL.md
---
description: Audit codebase resilience and generate an HTML fragility report
allowed-tools: Bash, Read, Write, Glob, Grep
---

# Chaos Auditor

Systematically identify fragile code paths that would break under adverse conditions: missing error handling, unsafe environment variable access, untested critical paths, and single points of failure. Generate an HTML report with findings and severity ratings.

## Steps

1. **Scan error handling patterns.** Search the codebase for:
   - `try/catch` blocks — check if catch blocks are empty or only log without rethrowing or handling
   - `.catch(` on promises — check for swallowed errors
   - Async functions without `try/catch` or `.catch` — unhandled promise rejections
   - Callback-style error handling — check if error parameter is actually used
   - `throw` statements — ensure thrown errors are caught somewhere in the call chain

2. **Audit environment variables.** Find all `process.env.*` or `import.meta.env.*` usage:
   - Flag any usage without a default value or fallback (e.g., `process.env.DB_HOST` without `??` or `||`)
   - Check if there is a validation layer (like `envalid` or a Zod schema) that catches missing vars at startup
   - Cross-reference with `.env.example` to find undocumented variables
   - Flag env vars used in critical paths (database connections, auth secrets) vs. non-critical (feature flags, log levels)

3. **Map test coverage.** For each source file in `src/`:
   - Check if a corresponding test file exists (`*.test.*`, `*.spec.*`, or file in `__tests__/`)
   - For files without tests, check the function length — functions over 50 lines with no tests are high-risk
   - Identify critical-path files (auth, payment, database) that lack tests

4. **Detect single points of failure.**
   - Find modules imported by >10 other files — a bug here cascades everywhere
   - Find functions with no type safety on inputs (any-typed parameters)
   - Find hardcoded URLs, IPs, or connection strings (instead of config)
   - Find retry-less network calls (`fetch`, `axios`, database queries without retry logic)

5. **Check graceful degradation patterns.**
   - Are there circuit breakers or timeout wrappers around external service calls?
   - Is there fallback behavior when a dependency is unavailable?
   - Are there health check endpoints?

6. **Generate HTML report.** Create `chaos-audit-report.html` with:
   - Executive summary with overall resilience score (0-100)
   - Four sections matching the audit areas above
   - Each finding has: file location, code snippet, severity (Critical/High/Medium/Low), remediation suggestion
   - Color-coded severity badges
   - Collapsible sections for large finding lists
   - Summary statistics and a donut chart showing finding distribution by severity

7. **Open the report.** Run `open chaos-audit-report.html`.

## Severity rating guide

- **Critical**: Unhandled errors in auth/payment paths, missing env vars for database/secrets with no startup validation
- **High**: Untested functions >50 lines in business logic, retry-less calls to external services
- **Medium**: Empty catch blocks, env vars without defaults in non-critical paths
- **Low**: Missing tests for utility functions, hardcoded non-secret values

## Rules

- Do not flag patterns that are intentionally lenient (e.g., optional feature flags).
- Always verify findings — read the surrounding code to confirm the issue is real before including it.
- Sort findings by severity, then by file path.
- Include a "Quick Wins" section at the top with the 5 easiest-to-fix high-impact findings.

How It Works

This skill applies chaos engineering principles without actually breaking anything. Instead of injecting failures into a running system (which requires infrastructure), it statically analyzes the code to predict where failures *would* cascade. It answers the question: "If this external service went down right now, what would happen?"

The env var audit is deceptively powerful. In many Node.js applications, a missing environment variable does not crash at startup — it silently becomes `undefined` and causes cryptic failures later. By cross-referencing `process.env` usage with `.env.example` and checking for validation layers, this skill catches configuration landmines before they detonate in production.

The HTML report format with severity ratings and remediation suggestions transforms a daunting audit into an actionable checklist. The "Quick Wins" section is a psychological nudge — by surfacing easy fixes first, it builds momentum and makes the overall resilience improvement feel achievable rather than overwhelming.