Suppose your boss just came up, and said you need to do a migration of your production environment, during the middle of business hours, what would you do? This was the exact situation I was faced with on my own infrastructure, I had little to no time at all, the most was during regular business operating hours. So, with a little magic, and a lot of luck, I migrated over three of my production websites (20k visitors/hour on average), without a single bit of downtime. Let's dive in!

Rules of Engagement

When migrating critical infrastructure, it's important to lay out a few ground rules, here's what I had come up for all our production infrastructure:

  1. There must be less than 5 minutes of downtime per service across the whole migration
  2. We must let both old and new production run in harmony, and slowly filter traffic away, however, keeping both instances in sync
  3. Email infrastructure must be seamlessly moved in-sync without IP address renumbering
  4. It needs to happen in the middle of the business day (11 AM - 2 PM)

Preparation

In order to prepare, I allocated a /28 Subnet to the new production host, all with public-facing IP addresses. This will serve as a temporary home until we can release the old subnets IP addresses.

Next, we started spinning up better specification virtual machines on our production host, our production database (PostgreSQL) went from 8 GB RAM / 120 GB HARD DRIVE to 16 GB / 250 HARD DRIVE, on 8 GB of ram, we were getting testy, we threw an extra vCPU core in the new machine. Almost our entire production infrastructure virtual machines have doubled in size (RAM/vCPU/HDD), a long-awaited upgrade.

We began slowly rsyncing our data over, and setup a realtime script to transfer files once rsync was completed. With all files syncing over a two hour observation period, and our PostgreSQL database mirrored, it was time to do the host switching live in production.

Always use vanity names, never direct IP addresses

A lesson learnt long in the past, it's easier to update an A record, than it is to re-number entire infrastructure IP Subnets (/26 in active use for production).

Our database server had a vanity hostname of "rdb-postgresql-production.vms-internal.kuby.ca", from our DNS server we updated it to include the new PostgreSQL server's IP address, and set a rule in haproxy to drain connections from the old server, in favour of the new server (we had our system keeping them in perfect sync during this time). Once haproxy drained all the connections to our old production, we successfully turned off the PostgreSQL server on the old host, and refreshed all our websites - the new host lives!

Now, let's move the SaaS products, and marketing website

This was expected to be a simple task, and it was - taking the lesson we learned from doing a drain with haproxy, we drained the old backend into the new, and killed off the old host, it worked perfectly as expected.

Now, we just move the subnet to the new host, and presto, our original IPs are still there, plus a new /28 Subnet for our development network. Three hours in, and we had moved production. According to our uptime counter, we had 13.5 minutes of downtime, not bad for the middle of peak business hours.

Till next time!

cloud engineering postgresql migration

Mike

Senior Software Engineer, Labber, Sysadmin. I make things scale rapidly. Optimize everything.

Read More