Share on facebook
Share on twitter
Share on linkedin

BY Dirk van Dooren and Luuk Rutten

How to create a webscraper/ monitoring solution in AWS

This tutorial will teach you how to monitor webpages for changes, send out notifications for these changes, and subscribe to the notifications by using AWS Lambdas. Additionally you can use our COVID-19 monitoring solution to stay up to date in the Netherlands
Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn

Intro

This blog was written with two purposes in mind: Helping people keep up to date regarding the latest news on COVID-19 put out by the RIVM (the Dutch authority on health), and creating instructions on how to make a simple solution for custom monitoring of changes on a webpage. 

In this form below you can leave your phone number and/or e-mail address to be notified on changes made on the RIVM website regarding COVID-19 (corona).

Subscribe here

You can fill in your details in both fields, or just one. We will only use your data for providing you with updates about changes to the RIVM website as mentioned

The technical specifics

Lambda 1, Monitoring

The beating heart of this infrastructure resides in an AWS Lambda triggered by a CloudWatch event.

This lambda does four things: It GETs the COVID-19 page from the RIVM website, it compares it with a previous version of the page, it stores this version of the page and finally it sends an update to the SNS Topic. 

For this lambda we need four imports. boto3, urllib3, os, and from bs4 import BeautifulSoup.

http = urllib3.PoolManager()
url = 'https://www.rivm.nl/nieuws/actuele-informatie-over-coronavirus'
resp = http.request('POST',url)

First we get the page:

Then we get the previous post

object_s3 = boto3.resource('s3') \
                 .Bucket(BUCKET_NAME) \
                 .Object(file_name)
old_page = object_s3.get().get('Body').read()
def find_latest_post(page):
         soup = BeautifulSoup(page, features="html.parser")	
         latest = soup.find("div", class_="LatestNews")
         return latest
new_page_news =find_latest_post(resp.data)
old_page_news = find_latest_post(old_page)

Make use of the find function in BeautifulSoup to get the html element you want to compare. For documentation purposes I will use a simple one.

Then we simply compare the news just gotten from the website with the news item that we had stored in the s3 bucket using the function we just declared.

If we find a difference, we know the website has been updated and publish a message to the SNS topic, which then notifies everyone who has been subscribed.

topic = #arn of the topic
sns_client = boto3.client('sns')
If new_page_news == old_page_news:
       print("nothing new")
else:
       sns_client.publish(
		TopicArn=topic
		Message=f"Nieuwe RIVM COVID19 update beschikbaar: "{new_post_title}".\nBekijk het bericht op {url}',
		Subject="RIVM COVID19 Update"
		MessageAttributes={
			'AWS.SNS.SMS.SenderID':{
				'DataType': 'String',
				'StringValue': 'CD19Update'
			}
		}
	)

Lastly, we write the new message to the s3 bucket for the next comparison.

object_s3.put(Body = new_page)

Now this lambda is done and we can have a look at the second lambda which is used to subscribe.

Lambda 2, Subscription

On the other side of the solution, this form submits a POST to an API Gateway with Lambda integration. The lambda needs to do three things: check Authorization, retrieve the contents of the body and lastly subscribe these contents to the SNS TOPIC

Check the provided password like this, preferably you would get this password from secretsmanager first, but this will also work:

pwd = event['queryStringParameters']['pwd']
if(pwd != password): 	#preferably retrieve the password from secrets manager	
          return{	 
                  'statusCode':403,
                  'body': json.dumps("Forbidden")
          }

if there is a phone number provided. Rewrite it so that it uses your country code. For us that is.

if(phone[0:2] == "06" and len(phone) == 10):
        phone = "+31{}".format(phone[1:])
if(phone[0:4] == "0031" and len(phone) == 13):
        phone = "+{}".format(phone[2:])

Then simply subscribe to the topic.

client = boto3.client('sns')
topic = #arn of topic
response_email = client.subscribe(
	TopicArn = topic,
	Protocol = 'email' 		#or 'sms'
	Endpoint = email_address 	#or phone
        ReturnSubscriptionArn = False

With email subscription this will then automatically trigger a confirmation email. However for phone subscriptions, it does not. Therefore we add code for one SMS message to finish this Lambda.

client.publish(
	PhoneNumber = phone,
	Message= f"Bedankt voor het subscriben",
        MessageAttributes ={
		AWS.SNS.SMS.SenderID': {
			DataType': 'String',
			'StringValue': 'CVD19Update'

Conclusion

You have now learned how a custom monitoring solution for changes on a webpage can be implemented fairly easily. This is a basic solution which can easily be expanded upon. E.g. Additional checking if a phone number/email had already been subscribed. As of right now there is no unsubscribe function for phone numbers, this is also something that can be added. Also we could look at adjusting the subscription confirmation e-mail.

A screenshot of a cell phone  Description automatically generated

we could also think of custom messages, or filters on specifics in the message (maybe you only want to know about guidelines, maybe only numbers). You could also extend this solution to incorporate more datasources and add monitoring on those pages automatically.

However, for now, this is it. We hope you find this useful and now more than ever: be well!

You can find the source code for these AWS Lambdas in Cloudnation’s Publications Repository