Situation
DNS, transactional email, marketing site, and TLS issuance were spread across two vendors (Scaleway and Lovable) and an in-cluster cert-manager setup. The fragmentation was creating ops drag — the marketing team couldn't iterate on the site, cluster cert renewals had failure modes the engineering team wanted out of, and basic changes required coordination across three different systems. Leadership wanted a single edge platform.
Mandate
Own end-to-end. Plan the cutover, coordinate dependencies across infrastructure, marketing, and engineering, and execute a live switchover without disrupting production services or the marketing site.
Approach
- Audited the existing Scaleway setup — including transactional-email records confirmed unused and safe to drop — and produced the migration plan as a chain of nine dependent engineering issues so the team could see what blocked what.
- Hardened the plan after external review: pulled the ClusterIssuer rename into a separate PR to avoid Let's Encrypt rate-limit risk, made
proxied=falsea hard invariant on critical records, added DS/CAA preflight gates and in-cluster verification, split cleanup into phased PRs to keep rollback windows narrow. - Lowered DNS TTLs ~24h ahead of value swaps to keep propagation windows short.
What I built and ran
- PR-driven cutover sequence: baseline DNS PR, value-swap PRs, cleanup PRs. Live NS flip at 10:55 CDT, full propagation by 11:33, value-swap PRs merged 11:55/12:10, first cert-manager force-renewal succeeded at 12:11 — same morning.
- Marketing site re-platformed from a no-code tool (Lovable) to Cloudflare Workers, with GitHub Actions CI/CD, per-PR preview URLs for every branch, framework upgrades, and a fresh-start review of the inherited codebase. Turned a manual, single-person publish workflow into one any engineer could safely run.
- Hardening PR: consent-script extraction, OG metadata hardening, SSG head-hoist, Analytics Engine binding, Lighthouse demoted from per-PR to a weekly cron to cut CI noise.
- Production incident response: a CSP
script-srcdirective without'unsafe-inline'blocked the SSG-injected runtime — site broken. Rolled back to last good SHA, then deployed a fix restoring'unsafe-inline'plus a build-time static check that gates future builds against the Worker's CSP, so this can't recur silently. - Cluster-side: validated cert-manager ClusterIssuers across clusters now using Cloudflare DNS-01. Approved supporting platform PRs (Teleport role upgrade for wildcard CRD visibility; topolvm affinity fix) that unblocked fleet-wide reconciliation.
- Operational-knowledge handoff: catalog of edges the team would otherwise have re-hit — DNS provider apex-NS TTL platform-locks, record-ID rotation on out-of-band edits, cert-manager backoff on prior-failed Orders, Helm silent-failure modes, the necessity of pre-lowering TTLs.
Outcomes
- Migration completed live with no observed production impact.
- Certificate issuance fully migrated; first force-renewal succeeded immediately post-cutover.
- Marketing site re-platformed onto a real engineering workflow — preview URLs per PR, rollback in one click, CI gates against the exact failure mode that took the site down once.
- Cross-team execution across infra, marketing, and engineering completed cleanly. Operational-knowledge catalog handed off so the team won't be a single-person-knows-the-edge-cases organization.
Why it matters
Cross-domain migrations like this usually get split across three vendors or stall on the engineering backlog for months. I run them end to end — planning, the infrastructure changes, engineering reviews, production incident response, and the knowledge transfer afterward. If you have a vendor consolidation or platform migration waiting on engineering bandwidth, this is the kind of engagement I take on.