Music Graph Project: CI/CD and Production DevOps
Over the past few days we built out CI/CD and a dev VM. I used Claude CLI for the first time and it worked out better then I thought it would. At first I was taken back because it wanted to write all the code and just ask me if it could write the file. I stopped it and said please update claude.md to say that I want to understand and step into the code before we write anything. It started doing planning sessions with me after that.I also had to fix some of the Terraform it wrote. Below is the unedited post by claude CLI using OPUS this time
Phase 7 transforms the Music Graph project from a manually-deployed application into a production-ready system with automated testing, deployment pipelines, database backups, and reliable SSL certificate management. This phase focused on the infrastructure and automation needed for sustainable long-term operation.
Starting Point: Manual Everything
After Phase 6, the application worked well but had operational gaps:
- Manual deployments: SSH to VMs, git pull, docker-compose rebuild
- No automated testing: Changes went straight to production
- Single environment: Testing in production (risky)
- No backups: Database lived only in Docker containers
- SSL certificate issues: Certbot renewal failing with IP-restricted firewall
- No CI/CD: Every deployment required manual steps
For a learning project with one user (Aidan), this was acceptable. For a production application, it needed improvement.
The Build: Five Major Improvements
1. GitHub Actions CI/CD Pipeline
The Problem: No automated testing or deployment workflow.
The Solution: Three GitHub Actions workflows:
CI Workflow (ci.yml):
- Triggers on pull requests and pushes to main
- Runs pytest test suite (17 tests)
- Checks code coverage
- Runs flake8 linting
- Path filters: Only runs when Python or test files change
Dev Deployment (deploy-dev.yml):
- Triggers on push to main branch
- Builds Docker image tagged as
dev-latest - Pushes to Google Container Registry
- SSH to dev VM and pulls new image
- Restarts containers automatically
- Path filters: Only deploys when application code changes
Prod Deployment (deploy-prod.yml):
- Manual trigger only (workflow_dispatch)
- Builds image tagged as
productionANDlatest - Creates GitHub release with version tag
- Requires manual approval for safety
Path Filters Implementation:
on:
push:
branches: [ main ]
paths:
- 'app.py'
- 'models.py'
- '**.py'
- 'templates/**'
- 'static/**'
- 'requirements.txt'
This prevents unnecessary builds when only documentation or configuration changes. Don’t bake a new cake if the recipe didn’t change.
2. Separate Development Environment
Created a complete parallel environment for safe testing:
Dev Environment:
- Separate GCP VM (
dev-music-graph) - Own PostgreSQL database
- Subdomain:
dev.music-graph.billgrant.io - Own SSL certificate
- Pulls
dev-latestDocker images
Terraform Workspace Strategy:
terraform workspace select dev
terraform apply # Creates dev infrastructure
terraform workspace select prod
terraform apply # Creates prod infrastructure
Each workspace maintains separate state for VM configuration, firewall rules, and environment-specific settings.
Benefits:
- Test changes safely before production
- Parallel development possible
- Environment parity (both use Docker, PostgreSQL, same setup)
- No production impact from experiments
3. CI/CD Optimization (Issue #8)
After initial CI/CD setup, we optimized to reduce costs and complexity:
Image Tagging Strategy:
- Dev builds:
gcr.io/music-graph-479719/music-graph:dev-latest - Prod builds:
gcr.io/music-graph-479719/music-graph:production+latest - Releases:
gcr.io/music-graph-479719/music-graph:v1.0.0
This prevents dev and prod from colliding on the latest tag.
GCR Lifecycle Policy:
Created via Terraform to automatically clean up old images:
resource "google_artifact_registry_repository" "music_graph" {
cleanup_policies {
id = "delete-old-dev-images"
action = "DELETE"
condition {
tag_prefixes = ["dev-"]
older_than = "2592000s" # 30 days
}
}
cleanup_policies {
id = "keep-production-releases"
action = "KEEP"
condition {
tag_prefixes = ["v", "production", "latest"]
}
}
}
Docker Cleanup Automation:
Both deployment workflows now clean up old images on VMs:
docker system prune -af --filter "until=24h"
This prevents disk space issues that had previously filled both VMs to 100% capacity.
Terraform Restructure:
Split Terraform into logical directories:
terraform/
├── environments/
│ ├── dev/
│ │ ├── main.tf
│ │ └── terraform.tfvars
│ └── prod/
│ ├── main.tf
│ └── terraform.tfvars
└── project/
├── main.tf # GCS bucket, GCR policy, IAM
└── terraform.tfvars
- environments/: Workspace-specific VM configurations
- project/: Shared resources (GCS buckets, GCR policies, IAM)
This prevents folder sprawl while maintaining clear separation.
4. Database Backup System
The Problem: Database only existed in Docker containers. No backups, no disaster recovery.
The Solution: Automated daily backups to Google Cloud Storage.
Components:
1. GCS Bucket (Terraform):
resource "google_storage_bucket" "database_backups" {
name = "music-graph-backups-music-graph-479719"
location = "US" # Multi-region for disaster recovery
lifecycle_rule {
action {
type = "Delete"
}
condition {
age = 7 # Delete backups older than 7 days
}
}
}
2. Backup Script (backup-database.sh):
- Runs pg_dump from Docker container
- Uses
--clean --if-existsflags for restore compatibility - Compresses with gzip (5MB → 500KB)
- Uploads to GCS
- Logs all operations
- Cleans up local files
3. Cron Job (each VM):
# Daily database backup at 3:00 AM
0 3 * * * /home/billgrant/music-graph/backup-database.sh prod >> /var/log/music-graph-backup-cron.log 2>&1
4. Verification Script (verify-backup-setup.sh):
Checks 9 aspects of backup configuration:
- Backup script executable
- Log file permissions
- Docker group membership
- Database container running
- GCS bucket access
- Cron job configured
- And more…
5. Documentation (docs/database-backup-setup.md):
Complete setup guide including:
- One-time VM setup steps
- Restore procedure
- Troubleshooting
- Tested restore results
Restore Testing:
We tested the full disaster recovery procedure on dev:
- Created backup
- Deleted data (simulated data loss)
- Downloaded backup from GCS
- Restored to database
- Verified data recovery
Result: Complete data recovery in under 2 minutes. The --clean --if-exists flags proved essential - they allow restoring over an existing database without conflicts.
5. SSL Certificate Auto-Renewal Fix
The Problem: Certbot renewal failing because our firewall blocks port 80 (HTTP-01 challenge won’t work).
The Solution: Switch to Route53 DNS-01 challenge.
How DNS-01 Works: Instead of proving domain ownership via HTTP (port 80), certbot creates a DNS TXT record in Route53. Let’s Encrypt checks the DNS record to verify ownership. No firewall ports required.
Implementation:
1. AWS IAM Resources (Terraform):
resource "aws_iam_policy" "certbot_route53" {
name = "certbot-route53-dns-music-graph"
policy = jsonencode({
Statement = [
{
Effect = "Allow"
Action = ["route53:ListHostedZones", "route53:GetChange"]
Resource = "*"
},
{
Effect = "Allow"
Action = "route53:ChangeResourceRecordSets"
Resource = "arn:aws:route53:::hostedzone/*"
}
]
})
}
resource "aws_iam_user" "certbot" {
name = "certbot-music-graph"
}
resource "aws_iam_access_key" "certbot" {
user = aws_iam_user.certbot.name
}
Added AWS provider to existing Terraform project configuration (already had Google provider).
2. Configure AWS Credentials (both VMs):
# Create credentials file for root (certbot runs as root)
sudo mkdir -p /root/.aws
sudo bash -c 'cat > /root/.aws/credentials << EOF
[default]
aws_access_key_id = <from terraform output>
aws_secret_access_key = <from terraform output>
EOF'
sudo chmod 600 /root/.aws/credentials
3. Install Plugin and Get Certificates:
sudo apt install python3-certbot-dns-route53
# Get certificate using DNS-01
sudo certbot certonly \
--dns-route53 \
-d music-graph.billgrant.io \
--non-interactive \
--agree-tos \
-m email@example.com
4. Test Renewal:
sudo certbot renew --dry-run
# ✓ Congratulations, all simulated renewals succeeded
Why This Matters:
The firewall can stay locked down (only port 443 open) while certificates still renew automatically. This is more secure than opening port 80 for HTTP-01 challenges.
Lessons Learned
1. Path Filters Save Money and Time
Initially, every push to main triggered full build and deployment - even for documentation changes. Path filters ensure we only build when code actually changes.
Before: 10 deployments per day for typo fixes After: 2-3 deployments per day for actual code changes
2. Immutable Infrastructure Is Next
We’re still modifying VMs in place (SSH, apt install, manual changes). Phase 8 will address this with Packer images and immutable infrastructure.
Current state:
- Manual setup on each VM
- “Snowflake” servers (unique configurations)
- Hard to reproduce
Phase 8 goal:
- Packer builds VM images with all dependencies
- VMs deployed from images (no manual setup)
- Reproducible infrastructure
3. Backups Are Useless Until Tested
We built the backup system and assumed it worked. Then we tested restore and found issues:
- Missing
--clean --if-existsflags caused duplicate key errors - Needed to stop web container during restore
- Documentation gaps
Testing the restore procedure caught all these issues before a real disaster.
4. Multi-Cloud Is Reality
This project now uses:
- Google Cloud: VMs, Container Registry, Cloud Storage
- AWS: Route53 DNS, IAM for certbot
- GitHub: Source control, Actions, Container Registry alternative
Terraform manages both GCP and AWS resources in the same configuration. Multi-cloud isn’t a choice - it’s the reality of using best tool for each job.
5. Planning Sessions Are Worth It
Issue #8 (CI/CD optimization) had the best planning session yet:
- High-level concepts before code
- Discussion of trade-offs
- Caught issues early (latest tag collision, workspace concerns)
- Clear implementation plan before touching files
Quote from CLAUDE.md:
“This was a great planning session, best one yet. The way we did this is the way I am most comfortable with.”
Starting with understanding beats diving into code.
Workflow Evolution
This phase refined the development workflow documented in CLAUDE.md:
Code Change Workflow:
- Start with high-level explanation
- Let Bill ask questions (this is a learning project)
- Discuss trade-offs and alternatives
- Address concerns (Bill often catches important issues)
- Get explicit approval
- Then implement
For Refactoring Existing Code:
- Step-by-step, methodical changes
- Test after each change
- Use feature branches
- Get approval before applying
For New Features:
- Can do full-file generation
- Still review before applying
- Test locally before production
Current State
The application is now production-ready from an operations perspective:
✅ Automated Testing: 17 tests run on every PR ✅ Automated Deployment: Push to main deploys to dev automatically ✅ Safe Promotion: Manual approval required for production ✅ Database Backups: Daily automated backups, tested restore procedure ✅ SSL Auto-Renewal: Certificates renew automatically via DNS-01 ✅ Cost Optimization: Old images cleaned up automatically ✅ Separate Environments: Dev and prod fully isolated
Live Sites:
- Production: https://music-graph.billgrant.io
- Development: https://dev.music-graph.billgrant.io
Both environments run the same Docker containers, use the same database schema, and follow the same deployment process.
What’s Next: Phase 8
The roadmap is shifting based on lessons learned. Phase 8 will focus on Infrastructure Modernization (Issue #10):
Priorities:
- Remote Terraform State: Move from local state files to GCS backend (enables team collaboration, state locking)
- Production WSGI Server: Replace Flask dev server with Gunicorn (Issue #2)
- Immutable Infrastructure: Packer images, never modify running VMs (Issue #7)
- Secrets Management: Remove static secrets from repository (Issue #12 - blocking for public launch)
- Evaluate Cloud SQL: Compare managed PostgreSQL vs Docker database
Future Sub-Issues:
- Issue #11: Import DNS records to Terraform (everything as code)
- Issue #12: Remove static secrets (critical before public launch)
Phase 8 will address the “mutable infrastructure” problem - VMs are still modified manually. The goal is reproducible infrastructure from code.
Code
Changes visible in:
- GitHub Issue #8 - CI/CD Optimization
- CI/CD workflows in
.github/workflows/ - Backup system:
backup-database.sh,verify-backup-setup.sh,docs/database-backup-setup.md - Terraform changes:
terraform/project/main.tf(AWS provider, IAM, GCS bucket, GCR policy)
This is part of the Music Genre Graph project series. See the project introduction for the full roadmap.