Around 2015 I worked at an IT company that focuses on website hosting for semi-complex environments. Back in the day, the company wasn’t known in the IT market, but had great promise. To increase company value and revenue, the company just hauled in their biggest migration project.
We, the engineers, had to migrate about 15 different applications from on-premise to a private cloud solution. It included automating the new setup with Ansible. Over 100 VM’s, redundantly setup in multiple data centers to be fault-tolerant.
Performing a migration is technically different in 2021 with the cloud solutions available today. In this blog post, I describe three lessons that I experienced during my migration and what could’ve been improved using AWS.
1. Know what you sell
The SLA was written by our colleagues from sales and with the help of a single engineer. The SLA included technical designs and tools to use.
Unfortunately, there were a lot of moving targets as the engineering team was preparing the new environment. We had to deviate from a lot of technical items which were agreed upon in the SLA. As you can imagine, this isn’t beneficial for a project like this. It costs time and thus money. I won’t dive into too much detail, but there was an agreement to use a fairly new distributed filesystem tool named Gluster which is a product of RedHat.
In our use case, Gluster had to distribute about a million small files to each VM using the filesystem. During the preparation phase, Gluster gave so many errors like split-brains, application failures and degradation in network speed.
Eventually, we hired a consultant to look at our use case and hoped he could provide a solution. Since Gluster was a fairly new product which is designed for larger files, the consultant couldn’t help us. After weeks and weeks of experiencing issues with Gluster, we eventually switched to the old and trusted NFS. After that, the distributed filesystem issues were simply solved.
In this day and age, this use case would be a perfect fit for Amazon Elastic File System (EFS). Amazon EFS provides a simple, serverless, set-and-forget, elastic file system that lets you share file data without provisioning or managing storage. This eliminates the need to provision and manage capacity to accommodate growth.
Especially if you’re in a situation where you need to re-platform your applications and don’t have the time to re-factor. Then having simple optimizations like Amazon EFS can bring a huge benefit to how you manage your platform.
2. Cold migration means slow migration
The idea was to automate the new environment with Ansible. The configuration of each VM had to be written in Ansible, in an idempotent fashion. A task I worked on was writing the configuration of clustering a MongoDB cluster and eventually write a migration playbook as well. The customer created a dump of their MongoDB database on their on-premise environment, then placed it on a server where we could pick it up.
As we upload the dump to the MongoDB server, unpack it and then import it as well. Once imported, Mongo itself had to replicate the database over to the remaining VM’s. This process took about five hours for a database of 50GB.
Nowadays this is way too long for such a process. The AWS Database Migration Service is free to use for 6 months per instance if you're migrating to Amazon Aurora, Amazon Redshift, Amazon DynamoDB or Amazon DocumentDB (with MongoDB compatibility).
A huge upside of this service is that the source database remains fully operational during the migration, minimizing downtime to applications that rely on the database.
3. The migration
We planned a weekend to migrate once all preparations were met. We set the date to migrate on a Friday at 8pm. For over half a year we were developing/configuring/working to prepare for that exact moment. Now, we arrive at the moment that it is that Friday at 8pm..
The developers, engineers and management are all present and ready to migrate. The first script takes off and we are middle in the migration phase. After a couple of hours, we expected at least some issues. However, this was not the case. We just felt like a little miracle was happening. No issues arose, no alerts popped up in our monitoring system, everything just went smooth. At least, that's what we thought. Once all systems were a go, we flipped the DNS records and customers visiting the website landed on the new environment. That evening we all went to bed with a satisfied feeling.
That Monday at 09:30 alerts started popping up in our monitoring. The site randomly gave 503 error messages, what the heck!? As we were trying to find the root cause, we saw that the site has about 300 active visitors. Something in our platform wasn’t designed to handle that much of a workload. We did performance tests, but as it turned out, that wasn’t expecting the load that we got at that moment. Eventually it took us the whole day to scale up web servers manually so that we were able to handle the load. This taught us the lesson to make our system scalable in case of unexpected events.
By moving the workload into AWS, it provides the possibility to automatically scale your workload to meet the demand. Using auto scaling groups we’re able to increase or decrease the number of instances to meet certain changing conditions. A condition could be “place two more instances once the CPU usage hits 80% for ten minutes or longer” or “decrease two instances when the average number of bytes received on all network interfaces is 1.000”. By connecting the target auto scaling group to an Application Load Balancer we’re able to effectively send the incoming requests to the instances, ensuring the application can run without downtime.
These three lessons have been and will be very helpful in my career as a AWS cloud consultant. I'd happily let you learn from my mistakes, so you don't have to repeat them. Link up with me and get in touch if you have any questions about migrating from on-premise to cloud!