Skip to main content

Operations & Troubleshooting

Maintaining platform reliability requires robust observability, standardized rollback procedures, and a systematic approach to technical troubleshooting. This guide outlines the engineering standards for production operations.

Health Checks & Liveness

Every service must expose standardized health endpoints to facilitate orchestration (e.g., Docker Compose, Kubernetes) and external monitoring.

Standard Endpoints

EndpointPurposeLogic
/healthLivenessVerifies the process is running (Fast, no external dependencies).
/health/readyReadinessVerifies the service can handle traffic (Checks DB, Cache, etc.).

Docker Compose Implementation

Implement health checks in docker-compose.yml to ensure dependencies are ready before dependent services start.

services:
api:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 10s
timeout: 3s
retries: 3
start_period: 30s

Logging Strategy

We standardize on Structured Logging (JSON) in production to enable efficient log aggregation, programmatic searching, and automated alerting.

Operations Commands

ObjectiveCommand
Follow Logsdocker compose logs -f <service>
Search Errorsdocker compose logs | grep -i error
Time-based Viewdocker compose logs --since 15m <service>
Log Retention

In production environments, configure the json-file logging driver with max-size and max-file limits to prevent disk exhaustion.


Rollback Procedures

When a deployment failure occurs, prioritize returning the system to a "Known Good" state immediately.

Service Rollback

Redeploy the previous stable Docker image tag from the container registry.

# Pull the specific known-good version
docker compose pull registry.example.com/api:prod-a1b2c3d

# Recreate the container with the stable image
docker compose up -d --force-recreate api

Database Rollback

If a database migration is destructive or causes application failures, restore from the pre-migration backup.

# Execute the restoration script using the timestamped backup
./scripts/restore-db.sh ./backups/pre_migration_backup.sql.gz

Troubleshooting Guide

Docker Infrastructure

SymptomPotential CauseRemediation
Port CollisionHost process occupying portRun lsof -i :<PORT> and terminate the conflicting process.
Connection RefusedService initialization delayVerify status with docker compose ps and check liveness logs.
DNS Resolution FailureNetwork driver glitchExecute docker compose restart to reset the virtual bridge.

Development Environment (Dev Containers)

SymptomRemediation
Slow Container StartupUse pre-built base images or minimize postCreateCommand logic.
Permission DeniedVerify updateRemoteUserUID: true is set in devcontainer.json.
SSH Authentication FailureExecute ssh-add -l on the host to ensure keys are available to the agent.

Production Release Checklist

Verify all items before executing a production deployment:

  • Data Integrity: Verified and recent database backups are accessible.
  • Observability: CPU, Memory, and Error Rate alerts are active.
  • Security: No secrets or environment variables are tracked in version control.
  • Validation: The exact image hash has been validated in the Staging environment.

Support

For critical production incidents, follow the internal on-call rotation protocol.