In a car there is a need for measuring instruments and notifications. On one hand you want real-time information on things like your current speed, speed limit and traffic information. On the other hand you want to be informed when you need to take action. I don’t want unnecessary speed fines on the autobahn when I'm headed for the Alps. I'd rather hear a beep when there is a “speed audit”. Certainly, I don’t want to get stuck in traffic when I could have spent that time on ski slopes with pristine white snow... Excuse me for my vacation aspiration, COVID-19 got me a wishing for things. But yeah, I'd rather be notified when there is a better route. It's pretty clear that instruments you use can have a serious impact on the quality of your journey.
Cloud Native Monitoring & Alerting
In the context of business, this quality can directly impact revenue, determine the happiness of your customers, or worse, make you lose customers. You don’t want that, so you measure things properly and setup notifications to perform actions when they are needed. You want to measure and get notified on your cloud resources.
What instruments you CAN use, depends on what the supplier provides. Your options are limited to what the infotainment system is providing you and in what ways you can connect to it. Maybe alerts for those “speed audits”, or recommendations to avoid that traffic congestion is already built in. Or maybe you need a third party: install and configure apps on your phone and connect every time you use the car. In the public cloud, major suppliers like Microsoft and Amazon also provide native tooling for monitoring and getting notified when action is needed. When the tooling ‘that came with it’ are not enough, your 3rd party options are limited to what can connect nicely to your chosen cloud platform.
What instruments you WILL use, depends on your use case. Do you need to monitor infrastructure that resides in the cloud as well as on-premises? Do you need to monitor only one public cloud or does your company have a multi-cloud strategy? Different use cases ask for different solutions.
A serverless architecture to manage servers
Okay, enough blah blah, let’s get our hands dirty. I will show you what a cloud native monitoring and alerting solution for servers on a single cloud (AWS) looks like and use “built-in” serverless cloud services. To keep it simple and help you understand the idea, we’re only looking at EC2 metrics. The goal here is to minimize time and resources spent on configuration and maintenance, minimize service costs, focus on the ability to use it on a large-scale environment and to get started quickly for new environments that need to be managed. So, to minimize tinkering and to maximize value.
The architecture consists of three parts:
- Creating a place to receive and pass on notifications;
- Figuring out what needs to be measured and finally;
- Creating alerts to send actionable notifications.
You can find the code here.
Plain and simple EC2 instances to explain the solution
Before we start, there are some requirements. EC2 servers are properly managed, with Systems Manager (SSM) and CloudWatch Agents installed and sending metrics to CloudWatch Alarms. For this example, you need to send out memory and disk space utilization.
For the first part, we prepare AWS SNS to receive notifications about the EC2 instances and pass them on via mail, Slack or something else of your liking. In this demo you see my mail address in the screenshot below. To deploy SNS we use a separate CloudFormation stack to have flexibility in your communications to the engineering team.
In the second part, we need to know what needs to be measured. No need to poach every application owner to guess which servers are used: in the cloud we can simply perform a scheduled scan to look for running resources. In this example we query our cloud environment what EC2 instances are running with Advanced Queries within AWS Config. Queries can get expensive if you are checking every minute, so we settle for a daily scan. According to the running EC2 instances found, we kick-off a CloudFormation stack. In the picture below we find two instances. This first part is automated with Python (boto3) in a Lambda function deployed with AWS SAM. So we can set it and forget it.
Finally, in the third part the CloudFormation stack is being created to manage our Alarms that send actionable notifications. With this stack CloudWatch Alarms are being created and gracefully deleted according to the scheduled findings of Config, so you don’t end up with an expensive mess when instances are terminated. Also, by using Config you can easily extend the monitoring solution to keep an eye on other AWS services like RDS, VPN, etc.
Within this solution, we have seen we only have (1) a CloudFormation stack to launch for setting up SNS, and (2) a Lambda function to deploy to start searching and setting up things to measure. When you extend it to other metrics, you can easily manage a large-scale environment cost effectively and quickly set up monitoring & alerting using infrastructure as code.
So now the question for you is: what is your experience on staying informed on your cloud resources and how is it currently going? Are you reaching the top of your snowy mountain while being in control the whole way through? If not entirely, maybe we at CloudNation can help. It’s always good to talk to someone who has been there before. Feel free to send me a message.