Photo by Luke Chesser on Unsplash
Building a URL Monitor in Python
With Multithreading, Poetry and Unit Testing
Monitoring website availability and response time is crucial for businesses and developers. In this post, we'll create a Python application that reads URLs from a text file, makes HTTP requests to each URL, and logs the HTTP status code, load time, and timestamp in an Excel file. To make our application efficient, we'll use multithreading to check multiple URLs simultaneously. We'll use Poetry for dependency management and unittest for testing.
Project Setup with Poetry
We'll begin by setting up our Python project with Poetry. Poetry is a tool for dependency management and packaging in Python, allowing you to declare libraries your project depends on and it will manage (install/update) them for you. It simplifies package management and deployment of Python applications.
- Install Poetry if you haven't done so:
shCopy codecurl -sSL https://install.python-poetry.org | python3 -
- Create a new Python project:
shCopy codepoetry new url_checker cd url_checker
- Add the necessary dependencies:
shCopy codepoetry add requests pandas openpyxl
The Python Script
Create a new Python script, say url_checker.py
, in the project's root directory with the following code:
import datetime
import time
import requests
import pandas as pd
import os
from concurrent.futures import ThreadPoolExecutor
# number of threads to use
NUM_WORKERS = 10
# delay in minutes
DELAY = 5 # adjust this value to your needs
# filename that contains the URLs
URL_FILE = "urls.txt"
# output file name
OUTPUT_FILE = "output.xlsx"
def check_url(url):
start_time = time.time()
try:
response = requests.get(url)
end_time = time.time()
load_time = end_time - start_time
status_code = response.status_code
except Exception:
status_code = "Invalid URL"
load_time = "N/A"
timestamp = datetime.datetime.now()
return url, status_code, load_time, timestamp
def main():
while True:
print("Working on URLs...")
with open(URL_FILE, "r") as file:
urls = [line.strip() for line in file]
with ThreadPoolExecutor(max_workers=NUM_WORKERS) as executor:
results = list(executor.map(check_url, urls))
new_df = pd.DataFrame(results, columns=['URL', 'Status Code', 'Load Time', 'Timestamp'])
# if file does not exist write header
if not os.path.isfile(OUTPUT_FILE):
new_df.to_excel(OUTPUT_FILE, index=False)
else: # else it exists so append without writing the header
df = pd.read_excel(OUTPUT_FILE)
df = pd.concat([df, new_df])
df.to_excel(OUTPUT_FILE, index=False)
print(f"Finished checking URLs. Sleeping for {DELAY} minutes.")
time.sleep(DELAY * 60)
if __name__ == "__main__":
main()
Our Python script consists of two main functions: check_url
and main
.
The check_url
function is responsible for making an HTTP request to a given URL and measuring the response time. It records the HTTP status code, load time, and the current timestamp. If the URL is invalid or not reachable, it will record "Invalid URL" as the status code and "N/A" as the load time.
The main
function is the driver function that reads URLs from a text file, uses a thread pool to check URLs concurrently, and logs the results in an Excel file. It runs in an infinite loop, sleeping for a specified number of minutes between each round of URL checks. It prints to the terminal whether it's currently working on URLs or sleeping.
Multithreading with ThreadPoolExecutor
The ThreadPoolExecutor
class from the concurrent.futures
module allows us to create a pool of worker threads. Each thread can check a URL independently of the others, which means we can check multiple URLs simultaneously. This is a significant speed-up if we have many URLs to check.
Logging Results in Excel
We use the pandas
library to log results in an Excel file. Each time the script checks URLs, it appends the results to the Excel file rather than overwriting it. It does this by reading in the existing data, concatenating the new data, and then writing it out again.
You can run the script with:
shCopy codepoetry run python url_checker.py
Testing with unittest
Python's built-in unittest
module allows us to write unit tests for our code. For our script, we could write tests to ensure that our check_url
function returns the correct output given a specific input. We could also write tests to ensure that our script correctly reads URLs from a file, logs results in an Excel file, and sleeps for the correct amount of time.
In our use case, writing unit tests would be a bit tricky due to the nature of the operations we are performing (network requests, file operations, and sleeping). However, we can still write some basic tests to verify our check_url
function.
Let's create a new file named test_url_checker.py
in the project's root directory:
import unittest
from url_checker import check_url
class TestURLChecker(unittest.TestCase):
def test_check_url(self):
# Testing with a valid URL
url, status_code, load_time, _ = check_url('https://www.google.com')
self.assertEqual(url, 'https://www.google.com')
self.assertEqual(status_code, 200)
self.assertNotEqual(load_time, 'N/A')
# Testing with an invalid URL
url, status_code, load_time, _ = check_url('https://www.nonexistentwebsite123456.com')
self.assertEqual(url, 'https://www.nonexistentwebsite123456.com')
self.assertEqual(status_code, 'Invalid URL')
self.assertEqual(load_time, 'N/A')
if __name__ == '__main__':
unittest.main()
This script tests the check_url
function with both a valid URL and an invalid URL. For the valid URL, we expect the status code to be 200 and the load time to not be 'N/A'. For the invalid URL, we expect the status code to be 'Invalid URL' and the load time to be 'N/A'.
You can run the tests with:
poetry run python test_url_checker.py
Remember to replace '
https://www.google.com
'
and '
https://www.nonexistentwebsite123456.com
'
with URLs of your choice.
Note: Testing with network requests can be unpredictable as the status code may vary based on network conditions, server availability, etc. For more reliable tests, you might want to consider using a library like responses
to mock HTTP requests.
In conclusion, the combination of Python with multithreading, Poetry, and unittest provides a powerful toolkit for building a URL monitor. This application is a robust, efficient solution for checking the availability and response time of a list of websites and logging the results for analysis. Happy coding!