Scouttlo
Todas las ideas/devtools/Plataforma SaaS que automatiza la creación y gestión de checklists de deployment, configuraciones de GitHub Actions, branch protection rules y environments con templates predefinidos y validación automática.
GitHubB2Bdevtools

Plataforma SaaS que automatiza la creación y gestión de checklists de deployment, configuraciones de GitHub Actions, branch protection rules y environments con templates predefinidos y validación automática.

Detectado hace 5 horas

7.3/ 10
Puntaje general

Convierte esta senal en ventaja

Te ayudamos a construirla, validarla y llegar primero.

Pasamos de la idea al plan: quien compra, que MVP lanzar, como validarlo y que medir antes de invertir meses.

Contexto extra

Ver mas sobre la idea

Te contamos que significa realmente la oportunidad, que problema existe hoy, como esta idea lo resolveria y los conceptos clave detras de ella.

Comparte tu correo para ver este analisis ampliado.

Desglose del puntaje

Urgencia8.0
Tamano de mercado7.0
Viabilidad8.0
Competencia6.0
Dolor

Los equipos de desarrollo necesitan gestionar complejos checklists de deployment y configuraciones de CI/CD de manera manual y propensa a errores.

Quien pagaria por esto

Equipos de desarrollo, DevOps engineers y CTOs de startups y empresas medianas que despliegan aplicaciones web.

Senal de origen

"Pre-deployment checklist — before first production deploy PR"

Publicacion original

Pre-deployment checklist — before first production deploy PR

Repository: Salk-Harnessing-Plants-Initiative/bloom Author: blm3886 # Pre-deployment checklist — Bloom v2 first production deploy Everything that must be verified / configured before the first production deploy. - **GitHub Actions → server**: self-hosted runner installed on the Salk server . - **TLS certs**: `tls internal` (Caddy generates its own CA locally, auto-renewing) for first deploy; swap for a real cert from Salk IT once obtained. --- ## Operating policy **Migrations and deploys to prod and staging are applied EXCLUSIVELY through the GitHub Actions deploy workflow.** Do NOT manually run `supabase db push`, `docker compose up`, or any deploy operation over SSH on the server except for a documented emergency recovery scenario. Any emergency manual operation must be logged on an incident ticket so the state difference between the workflow's expectation and reality is visible after the fact. **Why**: every code change that reaches prod should have a git SHA, PR review, audit trail, and rollback path. Manual operations bypass all four. --- ## Phase 1 — GitHub repository configuration ### Environments (Settings → Environments) - [x] `production` environment created: - [x] Required reviewers: at least 1 (Settings → Environments → production → Deployment protection rules → Required reviewers → Add reviewer) - [x] Deployment branch rule: `main` only - [x] Wait timer: 5 minutes (recommended) - [x] `staging` environment created: - [x] Deployment branch rule: `staging` only - [x] No required reviewers (staging is intended for iteration) ### Branch protection on `main` (Settings → Branches → Add branch protection rule) **Branch name pattern:** - [x] Pattern set to `main` **Require a pull request before merging:** - [x] Main checkbox ticked - [x] Required approving reviews: **1** - [x] Dismiss stale pull request approvals when new commits are pushed — ticked (recommended) **Require status checks to pass before merging:** - [x] Main checkbox ticked - [x] Require branches to be up to date before merging — ticked - [x] Following status checks added to the required list. Search by **display name** (what appears in the Actions UI): - [x] `Build & npm CVE Audit` (internal job id: `build-and-audit`) - [x] `Verify prod/staging env parity` (`verify-env-parity`) - [ ] `Validate committed env defaults` (`validate-env-defaults`) — added in PR #144 - [x] `Python Security Audit for CVEs` (`python-audit`) - [x] `Analyze (actions)` (CodeQL) - [x] `Analyze (javascript-typescript)` (CodeQL) - [x] `Analyze (python)` (CodeQL) **Deferred status checks — add AFTER their blocker merges:** - [ ] `Docker Compose Health Check` (`compose-health-check`) — add after PR #152 merges (removes `continue-on-error: true` so failures actually block merge) - [ ] `lint-migrations` — add after PR #146 merges (that PR introduces the job; requiring it now would permanently block all PRs) **Additional protections:** - [ ] Require conversation resolution before merging — ticked (recommended) - [ ] Require signed commits — optional (nice-to-have for audit trail) - [ ] Require linear history — optional (if your team prefers rebase-merge workflows) - [ ] Include administrators — ticked (so `main` is protected even from org admins) - [ ] Restrict who can push to matching branches — ticked, bypass list **empty** (nobody can push directly; everyone must PR) **Destructive-action protections (both MUST be unticked):** - [x] Allow force pushes — **UNTICKED** (force-pushing `main` would rewrite history; destructive) - [x] Allow deletions — **UNTICKED** (deleting `main` would be catastrophic) **After clicking Create / Save changes:** - [x] Confirm the Branch protection rules page now shows one row with pattern `main` and "✓ Active" status ### Secrets (Settings → Secrets and variables → Actions) **Updated after PR #144 (env-config refactor).** Non-sensitive config (domains, ports, URLs, flags) now lives in committed `.env.prod.defaults` and `.env.staging.defaults` files, no longer in GitHub Secrets. Only true credentials and infrastructure paths are kept here. Current state: **46 secrets in store (cap 100), 1 remaining to add** (`DEPLOY_HOST_KEY`). **Deploy access (4 total, 3 done, 1 gate):** - [x] `DEPLOY_HOST` — server address - [x] `DEPLOY_USER` — SSH user (`bloom-deploy`) - [x] **`DEPLOY_SSH_KEY`** — private SSH key for the deploy user (note: the workflow reads `secrets.DEPLOY_SSH_KEY`, not `DEPLOY_KEY`) - [ ] **`DEPLOY_HOST_KEY`** — server's SSH host public key. **This is the deploy arm-switch.** Adding it completes the chain that lets the next push-to-main or `workflow_dispatch` run a real deploy. Add only when ready. Capture via `ssh-keyscan -t ed25519 <DEPLOY_HOST>` from a trusted machine (not the server itself); verify the fingerprint out-of-band with the server admin; then `gh secret set DEPLOY_HOST_KEY`. **Deploy paths (2 total — both done):** - [x] `PROD_DEPLOY_PATH` — `/data/bloom/production` (moved off `/opt`; root fs was 84% full) - [x] `STAGING_DEPLOY_PATH` — `/data/bloom/staging` - [x] `PROD_MINIO_DATA_PATH` — `/data/bloom/minio-data` - [x] `STAGING_MINIO_DATA_PATH` — `/data/bloom/minio-staging` **Production credentials (15 total — all done; values from `scripts/generate-secrets.sh prod`):** - [x] `PROD_POSTGRES_PASSWORD` - [x] `PROD_JWT_SECRET`, `PROD_ANON_KEY`, `PROD_SERVICE_ROLE_KEY` - [x] `PROD_DB_ENC_KEY` — **must be exactly 16 ASCII chars** (AES-128 key size). `openssl rand -hex 8` → 16 hex chars = 16 ASCII bytes. Verify: `echo -n "$PROD_DB_ENC_KEY" | wc -c` returns 16. If it's 32, Realtime crashes ~90s into startup with a cryptic Elixir error. - [x] `PROD_VAULT_ENC_KEY`, `PROD_SUPAVISOR_ENC_KEY`, `PROD_SECRET_KEY_BASE` - [x] `PROD_MINIO_ROOT_USER` = `supabase`, `PROD_MINIO_PASSWORD` (admin credentials — username moved from defaults to secrets 2026-04-20; rotation to unique per-env value tracked in #151) - [x] `PROD_DASHBOARD_USERNAME` = `admin`, `PROD_DASHBOARD_PASSWORD` (Studio admin basic-auth — same move, same rotation follow-up) - [x] `PROD_BLOOMMCP_API_KEY` - [x] `PROD_OPENAI_API_KEY` — **real OpenAI key required.** If unset, agent chat calls fail (stack still deploys healthy). Test after deploy: `curl /api/langchain/chat`. - [x] `PROD_LANGCHAIN_API_KEY` — LangSmith tracing. Agent works without it (tracing no-ops); set for observability. **Staging credentials (15 total — 13 done, 2 placeholders):** - [x] Same set as prod with `STAGING_` prefix - [ ] `STAGING_OPENAI_API_KEY` — currently `needs-to-be-updated`. Replace with real key (or reuse prod's) when ready. - [ ] `STAGING_LANGCHAIN_API_KEY` — currently `needs-to-be-updated`. Replace when LangSmith project is set up for staging. **CI credentials (9 total — all done, used by `compose-health-check` in `pr-checks.yml`):** - [x] `CI_POSTGRES_PASSWORD`, `CI_JWT_SECRET`, `CI_ANON_KEY`, `CI_SERVICE_ROLE_KEY` - [x] `CI_DB_ENC_KEY`, `CI_VAULT_ENC_KEY`, `CI_SUPAVISOR_ENC_KEY`, `CI_SECRET_KEY_BASE` - [x] `CI_MINIO_PASSWORD` **Non-sensitive config (NOT in GitHub Secrets):** - All domains, URLs, ports, feature flags, pooler sizes, Postgres USER/HOST/PORT/DB, etc. live in `.env.prod.defaults` and `.env.staging.defaults` (committed to the repo). The deploy workflow concatenates defaults + secrets into `.env.{prod,staging}` on the runner at deploy time. Values are unit-tested against docker-compose.prod.yml references (see `tests/unit/test_env_defaults.py`) so CI catches drift. **SMTP credentials for GoTrue (optional — only if you want email-verify flow):** - [ ] `PROD_SMTP_HOST`, `PROD_SMTP_PORT`, `PROD_SMTP_USER`, `PROD_SMTP_PASS` — set only if `ENABLE_EMAIL_AUTOCONFIRM=false`. If `true` (the current default in `.env.prod.defaults`), skip — but then email-verify signup is effectively disabled. --- ## Phase 2 — Server setup ### Base OS + Docker - [x] Ubuntu 22.04+ with Docker + docker-compose v2 installed 24.04.4 LTS - [x] `bloom-deploy` user created, added to the `docker` group, **NOT root** - [x] Deploy-user SSH public key added to `~bloom-deploy/.ssh/authorized_keys` - [x] SSH-to-self works for `bloom-deploy@localhost` (self-hosted runner does loopback SSH to target): - [x] Local verify: `ssh bloom-deploy@localhost "whoami"` → `bloom-deploy` - [ ] Host key captured: `ssh-keyscan -t ed25519 localhost` → saved as `DEPLOY_HOST_KEY` secret - [x] ~~/opt/bloom/production/~~ `/data/bloom/production/` created, owned by `bloom-deploy`, initial `git clone` done (moved to `/data` because root fs is 84% full) - [x] ~~/opt/bloom/staging/~~ `/data/bloom/staging/` created, owned by `bloom-deploy`, initial `git clone` done - [x] MinIO **prod** data path `/data/bloom/minio-data` created, owned by `bloom-deploy`, writable (verified with `touch` + `rm`) - [x] MinIO **staging** data path `/data/bloom/minio-staging` created + verified writable: ### Firewall **Inbound rules (UFW — our control)** - [x] Port 80/tcp — restricted to `10.0.0.0/8` (Salk-internal) - [x] ~~Port 443/tcp — currently `Anywhere`; tighten to match 80 once Salk IT confirms CIDR (pending Elizabeth/Salk IT)~~ → Done 2026-04-20. Fernando (Salk IT) approved `10.0.0.0/8` for 22 and 443. UFW rule changed: `ALLOW IN Anywhere` → `ALLOW IN 10.0.0.0/8` (both v4 and v6 wide-open rules removed). - [ ] Port 22/tcp — currently `Anywhere`; tighten to admin CIDR (Salk VPN + admin IPs) once confirmed - [x] No other ports open (`sudo ufw status verbose` verified) **Outbound HTTPS — verified 2026-04-20** - [x] `api.github.com` → HTTP/2 200 - [x] `github.com` — (run to confirm) - [x] `hub.docker.com` → HTTP/2 200 - [x] `registry-1.docker.io/v2/` → HTTP/2 401 (expected — auth required, reachable) **Public IP note** bloom-dev has a public IP (`198.202.68.43`). Salk IT's upstream firewall may already block external traffic, but UFW was permissive on 22/443 at check time. ### Self-hosted GitHub Actions runner Pin version: **v2.3331.0** (matches what the deploy workflow expects). - [x] On the Salk server, as `bloom-deploy` (NOT root): - [x] Get a registration token: Repo Settings → Actions → Runners → **New self-hosted runner** → copy the token (valid ~1 hour). - [x] Register with the exact labels the workflow expects: - [x] Verify in Repo Settings → Actions → Runners: `bloom-prod-runner` shows status **Idle** with labels `self-hosted, linux, salk-network`. - [x] Verify the runner process is running as `bloom-deploy`, NOT root: `ps -o user= -p $(pgrep -f Runner.Listener)` → `bloom-deploy`. ### Supabase CLI (for the migration runner) Pin version: **v2.92.1** (must match the CI runner install in `pr-checks.yml`). - [x] On the Salk server, as `bloom-deploy`: ```bash SUPABASE_VERSION=2.92.1 curl -LO https://github.com/supabase/cli/releases/download/v${SUPABASE_VERSION}/supabase_${SUPABASE_VERSION}_linux_amd64.deb sudo apt install -y ./supabase_${SUPABASE_VERSION}_linux_amd64.deb ``` - [x] Verify: `supabase --version | grep 2.92.1` passes. - [x] **Port-collision check** before first `docker compose up`: `sudo ss -tlnp | grep -E ':5432|:5433'` returns empty. If anything is already listening on those ports (e.g. a native Postgres install), remove or relocate before proceeding — the compose port mapping will fail to bind otherwise. ### NTP / time sync (critical for JWTs and TLS) - [ ] `timedatectl` shows `System clock synchronized: yes` or [ ] If not: `sudo systemctl enable --now systemd-timesyncd` - [ ] JWTs and TLS certs both validate timestamps — clock drift >30s breaks login and cert issuance silently ### Docker log rotation (prevents disk fill) - [x] `/etc/docker/daemon.json` configured with log rotation: ```json { "log-driver": "json-file", "log-opts": { "max-size": "10m", "max-file": "3" } } ``` - [x] `sudo systemctl restart docker` after config change ### Quick resource check - [x] `df -h` — 50+ GB free - using /data - [x] `free -h` — 16 GB RAM sufficient for ~15 containers --- ## Phase 3 — DNS + TLS ### TLS certificates ( done in PR #145) Using **Caddy's internal CA (`tls internal`)** for the first deploy. Certs issued locally by Caddy, stored in `caddy-data` volume, auto-renewing. Browsers show one-time "not secure" warning because the CA isn't publicly trusted. Acceptable for Salk-internal-only service. - [x] Caddyfile uses `{$CADDY_TLS_DIRECTIVE}` → `tls internal` in prod env (PR #145) - [ ] PR #145 merged ### DNS records **Current state:** the parent `bloom-dev.salk.edu` A record is **already live in Salk DNS** . All that's missing are the **5 subdomain records** below. No change to the parent needed. **Domain naming** (matches `scripts/generate-env-config.sh`): | Env | Main | Studio | MinIO | |---|---|---|---| | **Prod** | `bloom-dev.salk.edu` exists | `studio.bloom-dev.salk.edu` | `minio.bloom-dev.salk.edu` | | **Staging** | `staging-bloom-dev.salk.edu` | `staging-studio.bloom-dev.salk.edu` | `staging-minio.bloom-dev.salk.edu` | **Approach:** First deployment is a soft launch. Until Salk IT adds the 5 missing subdomain records, ops laptops use `/etc/hosts` to map the subdomains → IP. #### For first deployment (ops laptops only — 5 missing subdomains) Add to `/etc/hosts` on each ops person's laptop (no entry for `bloom-dev.salk.edu` — that one resolves already): ``` # Prod subdomains (parent bloom-dev.salk.edu already resolves via Salk DNS) 198.202.68.43 studio.bloom-dev.salk.edu 198.202.68.43 minio.bloom-dev.salk.edu # Staging (all three subdomains) 198.202.68.43 staging-bloom-dev.salk.edu 198.202.68.43 staging-studio.bloom-dev.salk.edu 198.202.68.43 staging-minio.bloom-dev.salk.edu ``` - [x] Ops laptop(s) — 5 /etc/hosts entries added - [ ] `dig +short studio.bloom-dev.salk.edu` returns `198.202.68.43` on ops laptop #### For user rollout (Salk IT — async) Ask Salk IT to add these 5 internal A records, all resolving to `198.202.68.43`: Prod (2 new subdomains): - `studio.bloom-dev.salk.edu` - `minio.bloom-dev.salk.edu` Staging (3 new subdomains): - `staging-bloom-dev.salk.edu` - `staging-studio.bloom-dev.salk.edu` - `staging-minio.bloom-dev.salk.edu` All should resolve only on the Salk internal DNS zone (NXDOMAIN from outside Salk). Split-horizon DNS is fine. Parent `bloom-dev.salk.edu` unchanged. ### Follow-up: request DNS records from Salk IT - [ ] Email sent to Salk IT requesting the 5 records - [ ] Records created by Salk IT - [ ] `dig bloom-dev.salk.edu` from a Salk laptop (no /etc/hosts override) → `198.202.68.43` - [ ] `dig studio.bloom-dev.salk.edu` → `198.202.68.43` - [ ] `dig bloom-dev.salk.edu` from outside Salk → NXDOMAIN (internal-only confirmed) - [ ] /etc/hosts entries removed from ops laptops (cleanup once DNS is live) ### Follow-up: request a real cert from Salk IT ### Follow-up: get a real TLS cert to replace `tls internal` Currently Caddy uses a self-signed cert that causes a "not secure" warning in browsers. After deployment, swap to a real cert so users see a green padlock. #### Option 1 — Let's Encrypt (preferred, fully automatic) If Salk IT gives us a DNS API key, Caddy uses it to prove we own the domain to Let's Encrypt. LE then issues the cert. Caddy repeats this every ~60 days on its own. - [ ] Ask Salk IT for a DNS API key for `salk.edu` (scoped to write TXT records for `bloom-dev.salk.edu`, `studio.bloom-dev.salk.edu`, `minio.bloom-dev.salk.edu`, and staging variants `staging-bloom-dev.salk.edu`, `staging-studio.bloom-dev.salk.edu`, `staging-minio.bloom-dev.salk.edu`) - [ ] Ask which DNS provider Salk uses (Cloudflare, Route53, etc.) so we use the right Caddy plugin - [ ] Save the key as `PROD_DNS_API_TOKEN` in GitHub Secrets - [ ] Rebuild the Caddy image with the DNS plugin baked in - [ ] Update the `CADDY_TLS_DIRECTIVE` secret from `tls internal` to `tls { dns <provider> {env.DNS_API_TOKEN} }` - [ ] Redeploy. Check cert issuer is "Let's Encrypt". #### Option 2 — If Salk IT can't give a DNS API key: Salk IT issues a cert file (fallback, manual renewal) If Salk IT can't give a DNS API key, they can just hand us a cert file signed by Salk's own CA. Same result (trusted cert, green padlock), but we have to ask them again every 1-2 years when it expires. - [ ] Ask Salk IT for cert + key files covering: - prod: `bloom-dev.salk.edu`, `studio.bloom-dev.salk.edu`, `minio.bloom-dev.salk.edu` - staging: `staging-bloom-dev.salk.edu`, `staging-studio.bloom-dev.salk.edu`, `staging-minio.bloom-dev.salk.edu` Easiest is a SAN cert covering all six. Wildcard `*.bloom-dev.salk.edu` would also work for most (doesn't cover `bloom-dev.salk.edu` itself — that's a separate SAN). - [ ] Put the files on the server at `/data/bloom/caddy-certs/`, owned by `bloom-deploy` - [ ] Mount that directory into the Caddy container (`docker-compose.prod.yml`) - [ ] Update `CADDY_TLS_DIRECTIVE` to `tls /etc/caddy/certs/bloom-dev.salk.edu.crt /etc/caddy/certs/bloom-dev.salk.edu.key` - [ ] Redeploy. Check cert issuer is Salk's CA. - [ ] Calendar reminder 2 weeks before the cert expires to ask Salk IT for a renewal. ### Optional: distribute Caddy root cert to team devices Eliminates browser warnings without waiting for a real cert. One-time per device. - [ ] Export root cert after first deploy: ```bash docker compose exec caddy cat /data/caddy/pki/authorities/local/root.crt > caddy-root.crt --- ## Phase 4 — Merge deploy-related PRs (in order) Phase 5+ can't happen until `deploy.yml` is fully wired. Sequence below is coordinated with reviewers. All PRs include 4 small follow-up fixes from the adversarial self-review (`--debug` on Layer C migration summary, password masking in workflow logs, `cp -p` on env snapshot, `timeout-minutes: 30` on deploy jobs). ### Step 1 — CI-only safety nets (independent, no deploy.yml conflicts) - [ ] Merge [#147](https://github.com/Salk-Harnessing-Plants-Initiative/bloom/pull/147) → `main` — fix CI MinIO write probe (was silently failing as warning) - [ ] Merge [#148](https://github.com/Salk-Harnessing-Plants-Initiative/bloom/pull/148) → `main` — stronger `verify-env-parity` (catches cross-env leaks + RHS drift + composite-value bugs) ### Step 2 — Deploy hardening chain (order-sensitive, touches deploy.yml rollback block) Rebase each on the one before it. After each rebase, `git diff` the rollback block and confirm all previously-merged PRs' changes are intact (rebases can silently drop changes — verify by eye). - [ ] Merge [#139](https://github.com/Salk-Harnessing-Plants-Initiative/bloom/pull/139) → `main` — rollback safety: `git reset --hard`, `exit 1` on missing `.previous_sha`, `.secrets-*.txt` gitignore - [ ] Merge [#142](https://github.com/Salk-Harnessing-Plants-Initiative/bloom/pull/142) → `main` — docker-native healthchecks + `--wait` + concurrency guard + smoke test - [ ] Merge [#143](https://github.com/Salk-Harnessing-Plants-Initiative/bloom/pull/143) → `main` — `.env.prod` snapshot + restore on rollback, `umask 077` before heredocs, `cp -p` preserves 0600 ### Step 3 — Feature PRs that depend on Step 2 structure - [ ] Merge [#146](https://github.com/Salk-Harnessing-Plants-Initiative/bloom/pull/146) → `main` — Supabase CLI migration runner (deploy + CI), `--debug` workaround for cli#4839, password masking - [ ] Merge [#145](https://github.com/Salk-Harnessing-Plants-Initiative/bloom/pull/145) → `main` — Caddy `tls internal` via `CADDY_TLS_DIRECTIVE` env var (CI stays HTTP-only, prod gets local-CA HTTPS) ### Step 4 — Populate GitHub Secrets AFTER Step 3 and BEFORE Step 5. Until secrets exist, any deploy trigger fails harmlessly at the SSH step — deliberate safety net during the merge chain. - [ ] Deploy-access: `DEPLOY_HOST`, `DEPLOY_USER`, `DEPLOY_SSH_KEY`, `DEPLOY_HOST_KEY` - [ ] ~45 `PROD_*` config values via `scripts/generate-env-config.sh prod` - [ ] 9 `PROD_*` cryptographic secrets via `scripts/generate-secrets.sh prod` - [ ] Same two sets with `STAGING_*` prefix - [ ] `PROD_OPENAI_API_KEY` — real key (not `sk-placeholder`) ### Step 5 — Staging branch + self-hosted runner targeting - [ ] Create `staging` branch from current `main`: ```bash git fetch origin git checkout -b staging origin/main git push -u origin staging ``` - [ ] Change [#144](https://github.com/Salk-Harnessing-Plants-Initiative/bloom/pull/144)'s base branch from `main` → `staging` in GitHub UI - [ ] Verify `bloom-prod-runner` still shows **Idle** in Repo Settings → Actions → Runners - [ ] Merge [#144](https://github.com/Salk-Harnessing-Plants-Initiative/bloom/pull/144) → `staging` — switches `runs-on:` to `[self-hosted, linux, salk-network]` and adds `timeout-minutes: 30` to both jobs. This triggers the **first staging deploy** automatically. ### Step 6 — Verify staging, rollback rehearsal See Phase 5 and Phase 6 below. ### Step 7 — First prod deploy - [ ] Merge `staging` → `main` — triggers **first production deploy** - [ ] Manual approval in GitHub Actions (required reviewer gate fires via the `production` environment) - [ ] Monitor: migration summary, container health, smoke test - [ ] Rollback step present but do NOT trigger manually ### Why this order - **Steps 1-3 independent of secrets.** PRs land on `main` without real deploys firing — the deploy workflow triggers on every push but fails harmlessly at "Set up SSH" until secrets exist. - **Step 4 (secrets) gates real deploys.** Once secrets are in, any push to `main` would deploy. So secrets land right before we're ready. - **Step 5 (staging first).** #144 merges to `staging` (not `main`), triggering a validation deploy against the `staging` environment. Only after staging verifies do we merge `staging→main` for the real prod deploy. - **#144 merge last.** It's the one that "arms" the deploy pipeline. Merging earlier would queue 24h-failing deploys (pre-runner) or fail at SSH (pre-secrets). --- ## Phase 5 — Staging deploy (workflow_dispatch) - [ ] Trigger the staging deploy via Actions UI → `Deploy` workflow → Run workflow → environment: staging - [ ] Workflow routes to the self-hosted runner (verify "This job runs on self-hosted" in the UI) - [ ] All 15 containers reach healthy state - [ ] `supabase db push` applies all 158 historical migrations to the fresh staging DB - [ ] Migration summary appears at top of the run with every migration showing both `Local` and `Remote` columns populated - [ ] Storage buckets registered (`SELECT * FROM storage.buckets` returns 7 rows) - [ ] Login + signup + email verify flow works (or email-auto-confirm if SMTP not configured) - [ ] LangChain agent responds to a test query - [ ] File upload to storage succeeds via the UI --- ## Phase 6 — Rollback rehearsal on staging **Deliberately break a staging deploy and verify auto-rollback works.** Do NOT trust the rollback cold against production. - [ ] Create a branch with a deliberate break (choose one): - Point `bloommcp` at a nonexistent image tag - Add an obviously-broken SQL migration (e.g. `INVALID SQL;`) - Set `POSTGRES_HOST_PORT` to a colliding port - [ ] Trigger staging deploy - [ ] Verify: - Deploy fails loudly with a red X + `::error::` annotation at the top of the run summary - Diagnostic step runs and shows applied-vs-pending migrations - `Rollback on failure` step fires and reverts to the previous good SHA - Staging stack returns to the previous known-good state - Auto-generated GitHub Actions email notifies the team --- ## Phase 7 — Production deploy - [ ] All CI checks green on `main` - [ ] At least 1 reviewer approved the PR being deployed - [ ] (Recommended) Manual `pg_dump -Fc` snapshot of prod DB before first deploy as cheap insurance: ```bash ssh bloom-deploy@<host> " cd /data/bloom/production docker compose exec -T db-prod pg_dump -Fc -U supabase_admin -d postgres > ~/pre-first-deploy-$(date +%Y%m%d).pgdump ls -lh ~/pre-first-deploy-*.pgdump " ``` Keep until first deploy is confirmed successful. Recovery path: `pg_restore --clean -U supabase_admin -d postgres < ~/pre-first-deploy-*.pgdump`. - [ ] Push to `main` (or `workflow_dispatch` → production) - [ ] Approve production deploy in GitHub Actions (required reviewer gate fires) - [ ] Monitor real-time: migration summary updates; container health check; smoke test - [ ] Rollback step is present in the workflow but DO NOT trigger it manually --- ## Phase 8 — Post-deploy verification From a **Salk network host** (laptop on campus / VPN): - [ ] `https://bloom-dev.salk.edu` loads — accept `tls internal` one-time warning, verify site renders - [ ] Studio accessible at `https://studio.bloom-dev.salk.edu` - [ ] Staging accessible at `https://staging-bloom-dev.salk.edu:8443` - [ ] Sign up → email verify → sign in flow completes end-to-end - [ ] LangChain agent responds to a test query - [ ] File upload via UI succeeds → plot appears correctly in `/app/phenotypes` (tests storage+RLS, not just auth) - [ ] All containers healthy (`docker compose ps`) - [ ] Database connections working (Supabase login → auth returns a JWT) - [ ] MinIO writable (upload file via Studio → confirm it lands in the bucket) From **outside Salk network** (home laptop, non-VPN cellular): - [ ] `https://bloom-dev.salk.edu` fails to resolve OR connection times out (confirms the Salk-internal access control) - [ ] `nc -zv <public-host-or-ip> 443` fails --- ## Ongoing — basic operational hygiene (before routine deploys) ### Monitoring / alerting (minimum viable) - [ ] Slack webhook wired to GitHub Actions "deploy failed" event, or an email distribution that includes the deploy operator. (Full observability is tracked in #111; this is the Monday-minimum.) ### Incident response contacts - [ ] Document (in this issue or in `docs/runbooks/incident.md`) who gets called if `bloom-dev.salk.edu` is down Sunday at 2am: - Primary: <name + contact> - Fallback: <name + contact> - Salk IT after-hours number: <number> ### Deploy-timing policy - [ ] Documented convention: no Friday afternoon deploys, no pre-holiday deploys, no deploys during known busy hours. Outlasts the first deploy and sticks as team convention. ### JWT + key rotation reminder - [ ] Calendar reminder — rotate JWT keys every 5 years (run `scripts/generate-secrets.sh prod`, update GitHub Secrets, redeploy) - [ ] Calendar reminder — `DB_ENC_KEY` is effectively non-rotatable without a data-migration plan (encrypted columns would need re-encryption). Flag for #109 automated-backup work. --- ## Manual rollback procedure (if auto-rollback fails) Keep this documented somewhere accessible (this issue or `docs/runbooks/rollback.md`): ```bash ssh bloom-deploy@<DEPLOY_HOST> cd /data/bloom/production PREV=$(cat .previous_sha) echo "Rolling back to $PREV" git reset --hard $PREV docker compose -f docker-compose.prod.yml --env-file .env.prod up -d --build --remove-orphans ``` If `.previous_sha` is missing or wrong, use `git log --oneline` to find the last known good commit. **If a migration partially applied and is the cause of failure**, operator must manually fix the broken migration and redeploy — the Supabase CLI will skip already-applied ones. Partial-apply contract + RLS-specific mitigation is tracked in the rollback-automation follow-up proposal. --- ## Risk disclosures to acknowledge before first deploy - **#134 (destructive-SQL CI scan)** is still open. The CI lint in the migration PR catches filename + timestamp issues but NOT destructive SQL (`DROP TABLE`, `TRUNCATE`, etc.). **Promote #134 before the SECOND prod deploy.** --- ## Follow-up issues (for afterdeployment) - [ ] **Follow-up — audit root disk usage** (not a deploy blocker): `/` at 81% used (75 GB of 98 GB) after Docker migration. Remaining content is NOT Docker — likely old repo clones, package cache, or `/var/log/` accumulation. File as `chore(server): audit and reclaim space on bloom-dev root filesystem`. Run `sudo du -h -d 1 / | sort -rh | head` on the server to identify the biggest offenders. - Migration rollback automation + safety classifications + audit trail (Proposal 2) - Local dev: migrate `public._migrations` → `supabase_migrations.schema_migrations` tracking - Backfill rollback files for 156 historical migrations - Automated backups for Postgres + MinIO (#109) — promote from follow-up to deploy-blocker - Destructive-SQL CI scan (#134) — promote from follow-up to deploy-blocker - Monitoring / resource limits / log rotation (#111) - GHCR image publishing + SHA-pinned deploys (#107) - SSH heredoc refactor (#137) — now scoped over 16+ SSH blocks (10 existing + 6 new from the migration PR)