Plataforma SaaS que automatiza la creación y gestión de checklists de deployment, configuraciones de GitHub Actions, branch protection rules y environments con templates predefinidos y validación automática.
Detectado hace 5 horas
Convierte esta senal en ventaja
Te ayudamos a construirla, validarla y llegar primero.
Pasamos de la idea al plan: quien compra, que MVP lanzar, como validarlo y que medir antes de invertir meses.
Contexto extra
Ver mas sobre la idea
Te contamos que significa realmente la oportunidad, que problema existe hoy, como esta idea lo resolveria y los conceptos clave detras de ella.
Desglose del puntaje
Los equipos de desarrollo necesitan gestionar complejos checklists de deployment y configuraciones de CI/CD de manera manual y propensa a errores.
Equipos de desarrollo, DevOps engineers y CTOs de startups y empresas medianas que despliegan aplicaciones web.
"Pre-deployment checklist — before first production deploy PR"
Pre-deployment checklist — before first production deploy PR
Repository: Salk-Harnessing-Plants-Initiative/bloom Author: blm3886 # Pre-deployment checklist — Bloom v2 first production deploy Everything that must be verified / configured before the first production deploy. - **GitHub Actions → server**: self-hosted runner installed on the Salk server . - **TLS certs**: `tls internal` (Caddy generates its own CA locally, auto-renewing) for first deploy; swap for a real cert from Salk IT once obtained. --- ## Operating policy **Migrations and deploys to prod and staging are applied EXCLUSIVELY through the GitHub Actions deploy workflow.** Do NOT manually run `supabase db push`, `docker compose up`, or any deploy operation over SSH on the server except for a documented emergency recovery scenario. Any emergency manual operation must be logged on an incident ticket so the state difference between the workflow's expectation and reality is visible after the fact. **Why**: every code change that reaches prod should have a git SHA, PR review, audit trail, and rollback path. Manual operations bypass all four. --- ## Phase 1 — GitHub repository configuration ### Environments (Settings → Environments) - [x] `production` environment created: - [x] Required reviewers: at least 1 (Settings → Environments → production → Deployment protection rules → Required reviewers → Add reviewer) - [x] Deployment branch rule: `main` only - [x] Wait timer: 5 minutes (recommended) - [x] `staging` environment created: - [x] Deployment branch rule: `staging` only - [x] No required reviewers (staging is intended for iteration) ### Branch protection on `main` (Settings → Branches → Add branch protection rule) **Branch name pattern:** - [x] Pattern set to `main` **Require a pull request before merging:** - [x] Main checkbox ticked - [x] Required approving reviews: **1** - [x] Dismiss stale pull request approvals when new commits are pushed — ticked (recommended) **Require status checks to pass before merging:** - [x] Main checkbox ticked - [x] Require branches to be up to date before merging — ticked - [x] Following status checks added to the required list. Search by **display name** (what appears in the Actions UI): - [x] `Build & npm CVE Audit` (internal job id: `build-and-audit`) - [x] `Verify prod/staging env parity` (`verify-env-parity`) - [ ] `Validate committed env defaults` (`validate-env-defaults`) — added in PR #144 - [x] `Python Security Audit for CVEs` (`python-audit`) - [x] `Analyze (actions)` (CodeQL) - [x] `Analyze (javascript-typescript)` (CodeQL) - [x] `Analyze (python)` (CodeQL) **Deferred status checks — add AFTER their blocker merges:** - [ ] `Docker Compose Health Check` (`compose-health-check`) — add after PR #152 merges (removes `continue-on-error: true` so failures actually block merge) - [ ] `lint-migrations` — add after PR #146 merges (that PR introduces the job; requiring it now would permanently block all PRs) **Additional protections:** - [ ] Require conversation resolution before merging — ticked (recommended) - [ ] Require signed commits — optional (nice-to-have for audit trail) - [ ] Require linear history — optional (if your team prefers rebase-merge workflows) - [ ] Include administrators — ticked (so `main` is protected even from org admins) - [ ] Restrict who can push to matching branches — ticked, bypass list **empty** (nobody can push directly; everyone must PR) **Destructive-action protections (both MUST be unticked):** - [x] Allow force pushes — **UNTICKED** (force-pushing `main` would rewrite history; destructive) - [x] Allow deletions — **UNTICKED** (deleting `main` would be catastrophic) **After clicking Create / Save changes:** - [x] Confirm the Branch protection rules page now shows one row with pattern `main` and "✓ Active" status ### Secrets (Settings → Secrets and variables → Actions) **Updated after PR #144 (env-config refactor).** Non-sensitive config (domains, ports, URLs, flags) now lives in committed `.env.prod.defaults` and `.env.staging.defaults` files, no longer in GitHub Secrets. Only true credentials and infrastructure paths are kept here. Current state: **46 secrets in store (cap 100), 1 remaining to add** (`DEPLOY_HOST_KEY`). **Deploy access (4 total, 3 done, 1 gate):** - [x] `DEPLOY_HOST` — server address - [x] `DEPLOY_USER` — SSH user (`bloom-deploy`) - [x] **`DEPLOY_SSH_KEY`** — private SSH key for the deploy user (note: the workflow reads `secrets.DEPLOY_SSH_KEY`, not `DEPLOY_KEY`) - [ ] **`DEPLOY_HOST_KEY`** — server's SSH host public key. **This is the deploy arm-switch.** Adding it completes the chain that lets the next push-to-main or `workflow_dispatch` run a real deploy. Add only when ready. Capture via `ssh-keyscan -t ed25519 <DEPLOY_HOST>` from a trusted machine (not the server itself); verify the fingerprint out-of-band with the server admin; then `gh secret set DEPLOY_HOST_KEY`. **Deploy paths (2 total — both done):** - [x] `PROD_DEPLOY_PATH` — `/data/bloom/production` (moved off `/opt`; root fs was 84% full) - [x] `STAGING_DEPLOY_PATH` — `/data/bloom/staging` - [x] `PROD_MINIO_DATA_PATH` — `/data/bloom/minio-data` - [x] `STAGING_MINIO_DATA_PATH` — `/data/bloom/minio-staging` **Production credentials (15 total — all done; values from `scripts/generate-secrets.sh prod`):** - [x] `PROD_POSTGRES_PASSWORD` - [x] `PROD_JWT_SECRET`, `PROD_ANON_KEY`, `PROD_SERVICE_ROLE_KEY` - [x] `PROD_DB_ENC_KEY` — **must be exactly 16 ASCII chars** (AES-128 key size). `openssl rand -hex 8` → 16 hex chars = 16 ASCII bytes. Verify: `echo -n "$PROD_DB_ENC_KEY" | wc -c` returns 16. If it's 32, Realtime crashes ~90s into startup with a cryptic Elixir error. - [x] `PROD_VAULT_ENC_KEY`, `PROD_SUPAVISOR_ENC_KEY`, `PROD_SECRET_KEY_BASE` - [x] `PROD_MINIO_ROOT_USER` = `supabase`, `PROD_MINIO_PASSWORD` (admin credentials — username moved from defaults to secrets 2026-04-20; rotation to unique per-env value tracked in #151) - [x] `PROD_DASHBOARD_USERNAME` = `admin`, `PROD_DASHBOARD_PASSWORD` (Studio admin basic-auth — same move, same rotation follow-up) - [x] `PROD_BLOOMMCP_API_KEY` - [x] `PROD_OPENAI_API_KEY` — **real OpenAI key required.** If unset, agent chat calls fail (stack still deploys healthy). Test after deploy: `curl /api/langchain/chat`. - [x] `PROD_LANGCHAIN_API_KEY` — LangSmith tracing. Agent works without it (tracing no-ops); set for observability. **Staging credentials (15 total — 13 done, 2 placeholders):** - [x] Same set as prod with `STAGING_` prefix - [ ] `STAGING_OPENAI_API_KEY` — currently `needs-to-be-updated`. Replace with real key (or reuse prod's) when ready. - [ ] `STAGING_LANGCHAIN_API_KEY` — currently `needs-to-be-updated`. Replace when LangSmith project is set up for staging. **CI credentials (9 total — all done, used by `compose-health-check` in `pr-checks.yml`):** - [x] `CI_POSTGRES_PASSWORD`, `CI_JWT_SECRET`, `CI_ANON_KEY`, `CI_SERVICE_ROLE_KEY` - [x] `CI_DB_ENC_KEY`, `CI_VAULT_ENC_KEY`, `CI_SUPAVISOR_ENC_KEY`, `CI_SECRET_KEY_BASE` - [x] `CI_MINIO_PASSWORD` **Non-sensitive config (NOT in GitHub Secrets):** - All domains, URLs, ports, feature flags, pooler sizes, Postgres USER/HOST/PORT/DB, etc. live in `.env.prod.defaults` and `.env.staging.defaults` (committed to the repo). The deploy workflow concatenates defaults + secrets into `.env.{prod,staging}` on the runner at deploy time. Values are unit-tested against docker-compose.prod.yml references (see `tests/unit/test_env_defaults.py`) so CI catches drift. **SMTP credentials for GoTrue (optional — only if you want email-verify flow):** - [ ] `PROD_SMTP_HOST`, `PROD_SMTP_PORT`, `PROD_SMTP_USER`, `PROD_SMTP_PASS` — set only if `ENABLE_EMAIL_AUTOCONFIRM=false`. If `true` (the current default in `.env.prod.defaults`), skip — but then email-verify signup is effectively disabled. --- ## Phase 2 — Server setup ### Base OS + Docker - [x] Ubuntu 22.04+ with Docker + docker-compose v2 installed 24.04.4 LTS - [x] `bloom-deploy` user created, added to the `docker` group, **NOT root** - [x] Deploy-user SSH public key added to `~bloom-deploy/.ssh/authorized_keys` - [x] SSH-to-self works for `bloom-deploy@localhost` (self-hosted runner does loopback SSH to target): - [x] Local verify: `ssh bloom-deploy@localhost "whoami"` → `bloom-deploy` - [ ] Host key captured: `ssh-keyscan -t ed25519 localhost` → saved as `DEPLOY_HOST_KEY` secret - [x] ~~/opt/bloom/production/~~ `/data/bloom/production/` created, owned by `bloom-deploy`, initial `git clone` done (moved to `/data` because root fs is 84% full) - [x] ~~/opt/bloom/staging/~~ `/data/bloom/staging/` created, owned by `bloom-deploy`, initial `git clone` done - [x] MinIO **prod** data path `/data/bloom/minio-data` created, owned by `bloom-deploy`, writable (verified with `touch` + `rm`) - [x] MinIO **staging** data path `/data/bloom/minio-staging` created + verified writable: ### Firewall **Inbound rules (UFW — our control)** - [x] Port 80/tcp — restricted to `10.0.0.0/8` (Salk-internal) - [x] ~~Port 443/tcp — currently `Anywhere`; tighten to match 80 once Salk IT confirms CIDR (pending Elizabeth/Salk IT)~~ → Done 2026-04-20. Fernando (Salk IT) approved `10.0.0.0/8` for 22 and 443. UFW rule changed: `ALLOW IN Anywhere` → `ALLOW IN 10.0.0.0/8` (both v4 and v6 wide-open rules removed). - [ ] Port 22/tcp — currently `Anywhere`; tighten to admin CIDR (Salk VPN + admin IPs) once confirmed - [x] No other ports open (`sudo ufw status verbose` verified) **Outbound HTTPS — verified 2026-04-20** - [x] `api.github.com` → HTTP/2 200 - [x] `github.com` — (run to confirm) - [x] `hub.docker.com` → HTTP/2 200 - [x] `registry-1.docker.io/v2/` → HTTP/2 401 (expected — auth required, reachable) **Public IP note** bloom-dev has a public IP (`198.202.68.43`). Salk IT's upstream firewall may already block external traffic, but UFW was permissive on 22/443 at check time. ### Self-hosted GitHub Actions runner Pin version: **v2.3331.0** (matches what the deploy workflow expects). - [x] On the Salk server, as `bloom-deploy` (NOT root): - [x] Get a registration token: Repo Settings → Actions → Runners → **New self-hosted runner** → copy the token (valid ~1 hour). - [x] Register with the exact labels the workflow expects: - [x] Verify in Repo Settings → Actions → Runners: `bloom-prod-runner` shows status **Idle** with labels `self-hosted, linux, salk-network`. - [x] Verify the runner process is running as `bloom-deploy`, NOT root: `ps -o user= -p $(pgrep -f Runner.Listener)` → `bloom-deploy`. ### Supabase CLI (for the migration runner) Pin version: **v2.92.1** (must match the CI runner install in `pr-checks.yml`). - [x] On the Salk server, as `bloom-deploy`: ```bash SUPABASE_VERSION=2.92.1 curl -LO https://github.com/supabase/cli/releases/download/v${SUPABASE_VERSION}/supabase_${SUPABASE_VERSION}_linux_amd64.deb sudo apt install -y ./supabase_${SUPABASE_VERSION}_linux_amd64.deb ``` - [x] Verify: `supabase --version | grep 2.92.1` passes. - [x] **Port-collision check** before first `docker compose up`: `sudo ss -tlnp | grep -E ':5432|:5433'` returns empty. If anything is already listening on those ports (e.g. a native Postgres install), remove or relocate before proceeding — the compose port mapping will fail to bind otherwise. ### NTP / time sync (critical for JWTs and TLS) - [ ] `timedatectl` shows `System clock synchronized: yes` or [ ] If not: `sudo systemctl enable --now systemd-timesyncd` - [ ] JWTs and TLS certs both validate timestamps — clock drift >30s breaks login and cert issuance silently ### Docker log rotation (prevents disk fill) - [x] `/etc/docker/daemon.json` configured with log rotation: ```json { "log-driver": "json-file", "log-opts": { "max-size": "10m", "max-file": "3" } } ``` - [x] `sudo systemctl restart docker` after config change ### Quick resource check - [x] `df -h` — 50+ GB free - using /data - [x] `free -h` — 16 GB RAM sufficient for ~15 containers --- ## Phase 3 — DNS + TLS ### TLS certificates ( done in PR #145) Using **Caddy's internal CA (`tls internal`)** for the first deploy. Certs issued locally by Caddy, stored in `caddy-data` volume, auto-renewing. Browsers show one-time "not secure" warning because the CA isn't publicly trusted. Acceptable for Salk-internal-only service. - [x] Caddyfile uses `{$CADDY_TLS_DIRECTIVE}` → `tls internal` in prod env (PR #145) - [ ] PR #145 merged ### DNS records **Current state:** the parent `bloom-dev.salk.edu` A record is **already live in Salk DNS** . All that's missing are the **5 subdomain records** below. No change to the parent needed. **Domain naming** (matches `scripts/generate-env-config.sh`): | Env | Main | Studio | MinIO | |---|---|---|---| | **Prod** | `bloom-dev.salk.edu` exists | `studio.bloom-dev.salk.edu` | `minio.bloom-dev.salk.edu` | | **Staging** | `staging-bloom-dev.salk.edu` | `staging-studio.bloom-dev.salk.edu` | `staging-minio.bloom-dev.salk.edu` | **Approach:** First deployment is a soft launch. Until Salk IT adds the 5 missing subdomain records, ops laptops use `/etc/hosts` to map the subdomains → IP. #### For first deployment (ops laptops only — 5 missing subdomains) Add to `/etc/hosts` on each ops person's laptop (no entry for `bloom-dev.salk.edu` — that one resolves already): ``` # Prod subdomains (parent bloom-dev.salk.edu already resolves via Salk DNS) 198.202.68.43 studio.bloom-dev.salk.edu 198.202.68.43 minio.bloom-dev.salk.edu # Staging (all three subdomains) 198.202.68.43 staging-bloom-dev.salk.edu 198.202.68.43 staging-studio.bloom-dev.salk.edu 198.202.68.43 staging-minio.bloom-dev.salk.edu ``` - [x] Ops laptop(s) — 5 /etc/hosts entries added - [ ] `dig +short studio.bloom-dev.salk.edu` returns `198.202.68.43` on ops laptop #### For user rollout (Salk IT — async) Ask Salk IT to add these 5 internal A records, all resolving to `198.202.68.43`: Prod (2 new subdomains): - `studio.bloom-dev.salk.edu` - `minio.bloom-dev.salk.edu` Staging (3 new subdomains): - `staging-bloom-dev.salk.edu` - `staging-studio.bloom-dev.salk.edu` - `staging-minio.bloom-dev.salk.edu` All should resolve only on the Salk internal DNS zone (NXDOMAIN from outside Salk). Split-horizon DNS is fine. Parent `bloom-dev.salk.edu` unchanged. ### Follow-up: request DNS records from Salk IT - [ ] Email sent to Salk IT requesting the 5 records - [ ] Records created by Salk IT - [ ] `dig bloom-dev.salk.edu` from a Salk laptop (no /etc/hosts override) → `198.202.68.43` - [ ] `dig studio.bloom-dev.salk.edu` → `198.202.68.43` - [ ] `dig bloom-dev.salk.edu` from outside Salk → NXDOMAIN (internal-only confirmed) - [ ] /etc/hosts entries removed from ops laptops (cleanup once DNS is live) ### Follow-up: request a real cert from Salk IT ### Follow-up: get a real TLS cert to replace `tls internal` Currently Caddy uses a self-signed cert that causes a "not secure" warning in browsers. After deployment, swap to a real cert so users see a green padlock. #### Option 1 — Let's Encrypt (preferred, fully automatic) If Salk IT gives us a DNS API key, Caddy uses it to prove we own the domain to Let's Encrypt. LE then issues the cert. Caddy repeats this every ~60 days on its own. - [ ] Ask Salk IT for a DNS API key for `salk.edu` (scoped to write TXT records for `bloom-dev.salk.edu`, `studio.bloom-dev.salk.edu`, `minio.bloom-dev.salk.edu`, and staging variants `staging-bloom-dev.salk.edu`, `staging-studio.bloom-dev.salk.edu`, `staging-minio.bloom-dev.salk.edu`) - [ ] Ask which DNS provider Salk uses (Cloudflare, Route53, etc.) so we use the right Caddy plugin - [ ] Save the key as `PROD_DNS_API_TOKEN` in GitHub Secrets - [ ] Rebuild the Caddy image with the DNS plugin baked in - [ ] Update the `CADDY_TLS_DIRECTIVE` secret from `tls internal` to `tls { dns <provider> {env.DNS_API_TOKEN} }` - [ ] Redeploy. Check cert issuer is "Let's Encrypt". #### Option 2 — If Salk IT can't give a DNS API key: Salk IT issues a cert file (fallback, manual renewal) If Salk IT can't give a DNS API key, they can just hand us a cert file signed by Salk's own CA. Same result (trusted cert, green padlock), but we have to ask them again every 1-2 years when it expires. - [ ] Ask Salk IT for cert + key files covering: - prod: `bloom-dev.salk.edu`, `studio.bloom-dev.salk.edu`, `minio.bloom-dev.salk.edu` - staging: `staging-bloom-dev.salk.edu`, `staging-studio.bloom-dev.salk.edu`, `staging-minio.bloom-dev.salk.edu` Easiest is a SAN cert covering all six. Wildcard `*.bloom-dev.salk.edu` would also work for most (doesn't cover `bloom-dev.salk.edu` itself — that's a separate SAN). - [ ] Put the files on the server at `/data/bloom/caddy-certs/`, owned by `bloom-deploy` - [ ] Mount that directory into the Caddy container (`docker-compose.prod.yml`) - [ ] Update `CADDY_TLS_DIRECTIVE` to `tls /etc/caddy/certs/bloom-dev.salk.edu.crt /etc/caddy/certs/bloom-dev.salk.edu.key` - [ ] Redeploy. Check cert issuer is Salk's CA. - [ ] Calendar reminder 2 weeks before the cert expires to ask Salk IT for a renewal. ### Optional: distribute Caddy root cert to team devices Eliminates browser warnings without waiting for a real cert. One-time per device. - [ ] Export root cert after first deploy: ```bash docker compose exec caddy cat /data/caddy/pki/authorities/local/root.crt > caddy-root.crt --- ## Phase 4 — Merge deploy-related PRs (in order) Phase 5+ can't happen until `deploy.yml` is fully wired. Sequence below is coordinated with reviewers. All PRs include 4 small follow-up fixes from the adversarial self-review (`--debug` on Layer C migration summary, password masking in workflow logs, `cp -p` on env snapshot, `timeout-minutes: 30` on deploy jobs). ### Step 1 — CI-only safety nets (independent, no deploy.yml conflicts) - [ ] Merge [#147](https://github.com/Salk-Harnessing-Plants-Initiative/bloom/pull/147) → `main` — fix CI MinIO write probe (was silently failing as warning) - [ ] Merge [#148](https://github.com/Salk-Harnessing-Plants-Initiative/bloom/pull/148) → `main` — stronger `verify-env-parity` (catches cross-env leaks + RHS drift + composite-value bugs) ### Step 2 — Deploy hardening chain (order-sensitive, touches deploy.yml rollback block) Rebase each on the one before it. After each rebase, `git diff` the rollback block and confirm all previously-merged PRs' changes are intact (rebases can silently drop changes — verify by eye). - [ ] Merge [#139](https://github.com/Salk-Harnessing-Plants-Initiative/bloom/pull/139) → `main` — rollback safety: `git reset --hard`, `exit 1` on missing `.previous_sha`, `.secrets-*.txt` gitignore - [ ] Merge [#142](https://github.com/Salk-Harnessing-Plants-Initiative/bloom/pull/142) → `main` — docker-native healthchecks + `--wait` + concurrency guard + smoke test - [ ] Merge [#143](https://github.com/Salk-Harnessing-Plants-Initiative/bloom/pull/143) → `main` — `.env.prod` snapshot + restore on rollback, `umask 077` before heredocs, `cp -p` preserves 0600 ### Step 3 — Feature PRs that depend on Step 2 structure - [ ] Merge [#146](https://github.com/Salk-Harnessing-Plants-Initiative/bloom/pull/146) → `main` — Supabase CLI migration runner (deploy + CI), `--debug` workaround for cli#4839, password masking - [ ] Merge [#145](https://github.com/Salk-Harnessing-Plants-Initiative/bloom/pull/145) → `main` — Caddy `tls internal` via `CADDY_TLS_DIRECTIVE` env var (CI stays HTTP-only, prod gets local-CA HTTPS) ### Step 4 — Populate GitHub Secrets AFTER Step 3 and BEFORE Step 5. Until secrets exist, any deploy trigger fails harmlessly at the SSH step — deliberate safety net during the merge chain. - [ ] Deploy-access: `DEPLOY_HOST`, `DEPLOY_USER`, `DEPLOY_SSH_KEY`, `DEPLOY_HOST_KEY` - [ ] ~45 `PROD_*` config values via `scripts/generate-env-config.sh prod` - [ ] 9 `PROD_*` cryptographic secrets via `scripts/generate-secrets.sh prod` - [ ] Same two sets with `STAGING_*` prefix - [ ] `PROD_OPENAI_API_KEY` — real key (not `sk-placeholder`) ### Step 5 — Staging branch + self-hosted runner targeting - [ ] Create `staging` branch from current `main`: ```bash git fetch origin git checkout -b staging origin/main git push -u origin staging ``` - [ ] Change [#144](https://github.com/Salk-Harnessing-Plants-Initiative/bloom/pull/144)'s base branch from `main` → `staging` in GitHub UI - [ ] Verify `bloom-prod-runner` still shows **Idle** in Repo Settings → Actions → Runners - [ ] Merge [#144](https://github.com/Salk-Harnessing-Plants-Initiative/bloom/pull/144) → `staging` — switches `runs-on:` to `[self-hosted, linux, salk-network]` and adds `timeout-minutes: 30` to both jobs. This triggers the **first staging deploy** automatically. ### Step 6 — Verify staging, rollback rehearsal See Phase 5 and Phase 6 below. ### Step 7 — First prod deploy - [ ] Merge `staging` → `main` — triggers **first production deploy** - [ ] Manual approval in GitHub Actions (required reviewer gate fires via the `production` environment) - [ ] Monitor: migration summary, container health, smoke test - [ ] Rollback step present but do NOT trigger manually ### Why this order - **Steps 1-3 independent of secrets.** PRs land on `main` without real deploys firing — the deploy workflow triggers on every push but fails harmlessly at "Set up SSH" until secrets exist. - **Step 4 (secrets) gates real deploys.** Once secrets are in, any push to `main` would deploy. So secrets land right before we're ready. - **Step 5 (staging first).** #144 merges to `staging` (not `main`), triggering a validation deploy against the `staging` environment. Only after staging verifies do we merge `staging→main` for the real prod deploy. - **#144 merge last.** It's the one that "arms" the deploy pipeline. Merging earlier would queue 24h-failing deploys (pre-runner) or fail at SSH (pre-secrets). --- ## Phase 5 — Staging deploy (workflow_dispatch) - [ ] Trigger the staging deploy via Actions UI → `Deploy` workflow → Run workflow → environment: staging - [ ] Workflow routes to the self-hosted runner (verify "This job runs on self-hosted" in the UI) - [ ] All 15 containers reach healthy state - [ ] `supabase db push` applies all 158 historical migrations to the fresh staging DB - [ ] Migration summary appears at top of the run with every migration showing both `Local` and `Remote` columns populated - [ ] Storage buckets registered (`SELECT * FROM storage.buckets` returns 7 rows) - [ ] Login + signup + email verify flow works (or email-auto-confirm if SMTP not configured) - [ ] LangChain agent responds to a test query - [ ] File upload to storage succeeds via the UI --- ## Phase 6 — Rollback rehearsal on staging **Deliberately break a staging deploy and verify auto-rollback works.** Do NOT trust the rollback cold against production. - [ ] Create a branch with a deliberate break (choose one): - Point `bloommcp` at a nonexistent image tag - Add an obviously-broken SQL migration (e.g. `INVALID SQL;`) - Set `POSTGRES_HOST_PORT` to a colliding port - [ ] Trigger staging deploy - [ ] Verify: - Deploy fails loudly with a red X + `::error::` annotation at the top of the run summary - Diagnostic step runs and shows applied-vs-pending migrations - `Rollback on failure` step fires and reverts to the previous good SHA - Staging stack returns to the previous known-good state - Auto-generated GitHub Actions email notifies the team --- ## Phase 7 — Production deploy - [ ] All CI checks green on `main` - [ ] At least 1 reviewer approved the PR being deployed - [ ] (Recommended) Manual `pg_dump -Fc` snapshot of prod DB before first deploy as cheap insurance: ```bash ssh bloom-deploy@<host> " cd /data/bloom/production docker compose exec -T db-prod pg_dump -Fc -U supabase_admin -d postgres > ~/pre-first-deploy-$(date +%Y%m%d).pgdump ls -lh ~/pre-first-deploy-*.pgdump " ``` Keep until first deploy is confirmed successful. Recovery path: `pg_restore --clean -U supabase_admin -d postgres < ~/pre-first-deploy-*.pgdump`. - [ ] Push to `main` (or `workflow_dispatch` → production) - [ ] Approve production deploy in GitHub Actions (required reviewer gate fires) - [ ] Monitor real-time: migration summary updates; container health check; smoke test - [ ] Rollback step is present in the workflow but DO NOT trigger it manually --- ## Phase 8 — Post-deploy verification From a **Salk network host** (laptop on campus / VPN): - [ ] `https://bloom-dev.salk.edu` loads — accept `tls internal` one-time warning, verify site renders - [ ] Studio accessible at `https://studio.bloom-dev.salk.edu` - [ ] Staging accessible at `https://staging-bloom-dev.salk.edu:8443` - [ ] Sign up → email verify → sign in flow completes end-to-end - [ ] LangChain agent responds to a test query - [ ] File upload via UI succeeds → plot appears correctly in `/app/phenotypes` (tests storage+RLS, not just auth) - [ ] All containers healthy (`docker compose ps`) - [ ] Database connections working (Supabase login → auth returns a JWT) - [ ] MinIO writable (upload file via Studio → confirm it lands in the bucket) From **outside Salk network** (home laptop, non-VPN cellular): - [ ] `https://bloom-dev.salk.edu` fails to resolve OR connection times out (confirms the Salk-internal access control) - [ ] `nc -zv <public-host-or-ip> 443` fails --- ## Ongoing — basic operational hygiene (before routine deploys) ### Monitoring / alerting (minimum viable) - [ ] Slack webhook wired to GitHub Actions "deploy failed" event, or an email distribution that includes the deploy operator. (Full observability is tracked in #111; this is the Monday-minimum.) ### Incident response contacts - [ ] Document (in this issue or in `docs/runbooks/incident.md`) who gets called if `bloom-dev.salk.edu` is down Sunday at 2am: - Primary: <name + contact> - Fallback: <name + contact> - Salk IT after-hours number: <number> ### Deploy-timing policy - [ ] Documented convention: no Friday afternoon deploys, no pre-holiday deploys, no deploys during known busy hours. Outlasts the first deploy and sticks as team convention. ### JWT + key rotation reminder - [ ] Calendar reminder — rotate JWT keys every 5 years (run `scripts/generate-secrets.sh prod`, update GitHub Secrets, redeploy) - [ ] Calendar reminder — `DB_ENC_KEY` is effectively non-rotatable without a data-migration plan (encrypted columns would need re-encryption). Flag for #109 automated-backup work. --- ## Manual rollback procedure (if auto-rollback fails) Keep this documented somewhere accessible (this issue or `docs/runbooks/rollback.md`): ```bash ssh bloom-deploy@<DEPLOY_HOST> cd /data/bloom/production PREV=$(cat .previous_sha) echo "Rolling back to $PREV" git reset --hard $PREV docker compose -f docker-compose.prod.yml --env-file .env.prod up -d --build --remove-orphans ``` If `.previous_sha` is missing or wrong, use `git log --oneline` to find the last known good commit. **If a migration partially applied and is the cause of failure**, operator must manually fix the broken migration and redeploy — the Supabase CLI will skip already-applied ones. Partial-apply contract + RLS-specific mitigation is tracked in the rollback-automation follow-up proposal. --- ## Risk disclosures to acknowledge before first deploy - **#134 (destructive-SQL CI scan)** is still open. The CI lint in the migration PR catches filename + timestamp issues but NOT destructive SQL (`DROP TABLE`, `TRUNCATE`, etc.). **Promote #134 before the SECOND prod deploy.** --- ## Follow-up issues (for afterdeployment) - [ ] **Follow-up — audit root disk usage** (not a deploy blocker): `/` at 81% used (75 GB of 98 GB) after Docker migration. Remaining content is NOT Docker — likely old repo clones, package cache, or `/var/log/` accumulation. File as `chore(server): audit and reclaim space on bloom-dev root filesystem`. Run `sudo du -h -d 1 / | sort -rh | head` on the server to identify the biggest offenders. - Migration rollback automation + safety classifications + audit trail (Proposal 2) - Local dev: migrate `public._migrations` → `supabase_migrations.schema_migrations` tracking - Backfill rollback files for 156 historical migrations - Automated backups for Postgres + MinIO (#109) — promote from follow-up to deploy-blocker - Destructive-SQL CI scan (#134) — promote from follow-up to deploy-blocker - Monitoring / resource limits / log rotation (#111) - GHCR image publishing + SHA-pinned deploys (#107) - SSH heredoc refactor (#137) — now scoped over 16+ SSH blocks (10 existing + 6 new from the migration PR)