Case Study — Live platform migration

Situation

DNS, transactional email, marketing site, and TLS issuance were spread across two vendors (Scaleway and Lovable) and an in-cluster cert-manager setup. The fragmentation was creating ops drag — the marketing team couldn't iterate on the site, cluster cert renewals had failure modes the engineering team wanted out of, and basic changes required coordination across three different systems. Leadership wanted a single edge platform.

Mandate

Own end-to-end. Plan the cutover, coordinate dependencies across infrastructure, marketing, and engineering, and execute a live switchover without disrupting production services or the marketing site.

Approach

Audited the existing Scaleway setup — including transactional-email records confirmed unused and safe to drop — and produced the migration plan as a chain of nine dependent engineering issues so the team could see what blocked what.
Hardened the plan after external review: pulled the ClusterIssuer rename into a separate PR to avoid Let's Encrypt rate-limit risk, made proxied=false a hard invariant on critical records, added DS/CAA preflight gates and in-cluster verification, split cleanup into phased PRs to keep rollback windows narrow.
Lowered DNS TTLs ~24h ahead of value swaps to keep propagation windows short.

What I built and ran

PR-driven cutover sequence: baseline DNS PR, value-swap PRs, cleanup PRs. Live NS flip at 10:55 CDT, full propagation by 11:33, value-swap PRs merged 11:55/12:10, first cert-manager force-renewal succeeded at 12:11 — same morning.
Marketing site re-platformed from a no-code tool (Lovable) to Cloudflare Workers, with GitHub Actions CI/CD, per-PR preview URLs for every branch, framework upgrades, and a fresh-start review of the inherited codebase. Turned a manual, single-person publish workflow into one any engineer could safely run.
Hardening PR: consent-script extraction, OG metadata hardening, SSG head-hoist, Analytics Engine binding, Lighthouse demoted from per-PR to a weekly cron to cut CI noise.
Production incident response: a CSP script-src directive without 'unsafe-inline' blocked the SSG-injected runtime — site broken. Rolled back to last good SHA, then deployed a fix restoring 'unsafe-inline' plus a build-time static check that gates future builds against the Worker's CSP, so this can't recur silently.
Cluster-side: validated cert-manager ClusterIssuers across clusters now using Cloudflare DNS-01. Approved supporting platform PRs (Teleport role upgrade for wildcard CRD visibility; topolvm affinity fix) that unblocked fleet-wide reconciliation.
Operational-knowledge handoff: catalog of edges the team would otherwise have re-hit — DNS provider apex-NS TTL platform-locks, record-ID rotation on out-of-band edits, cert-manager backoff on prior-failed Orders, Helm silent-failure modes, the necessity of pre-lowering TTLs.

Outcomes

Migration completed live with no observed production impact.
Certificate issuance fully migrated; first force-renewal succeeded immediately post-cutover.
Marketing site re-platformed onto a real engineering workflow — preview URLs per PR, rollback in one click, CI gates against the exact failure mode that took the site down once.
Cross-team execution across infra, marketing, and engineering completed cleanly. Operational-knowledge catalog handed off so the team won't be a single-person-knows-the-edge-cases organization.

Why it matters