Building a URL Monitor in Python with Multithreading, Poetry and Unit

Monitoring website availability and response time is crucial for businesses and developers. In this post, we'll create a Python application that reads URLs from a text file, makes HTTP requests to each URL, and logs the HTTP status code, load time, and timestamp in an Excel file. To make our application efficient, we'll use multithreading to check multiple URLs simultaneously. We'll use Poetry for dependency management and unittest for testing.

Project Setup with Poetry

We'll begin by setting up our Python project with Poetry. Poetry is a tool for dependency management and packaging in Python, allowing you to declare libraries your project depends on and it will manage (install/update) them for you. It simplifies package management and deployment of Python applications.

Install Poetry if you haven't done so:

shCopy codecurl -sSL https://install.python-poetry.org | python3 -

Create a new Python project:

shCopy codepoetry new url_checker cd url_checker

Add the necessary dependencies:

shCopy codepoetry add requests pandas openpyxl

The Python Script

Create a new Python script, say url_checker.py, in the project's root directory with the following code:

import datetime
import time
import requests
import pandas as pd
import os
from concurrent.futures import ThreadPoolExecutor

# number of threads to use
NUM_WORKERS = 10

# delay in minutes
DELAY = 5  # adjust this value to your needs

# filename that contains the URLs
URL_FILE = "urls.txt"

# output file name
OUTPUT_FILE = "output.xlsx"

def check_url(url):
    start_time = time.time()
    try:
        response = requests.get(url)
        end_time = time.time()
        load_time = end_time - start_time
        status_code = response.status_code
    except Exception:
        status_code = "Invalid URL"
        load_time = "N/A"
    timestamp = datetime.datetime.now()
    return url, status_code, load_time, timestamp

def main():
    while True:
        print("Working on URLs...")
        with open(URL_FILE, "r") as file:
            urls = [line.strip() for line in file]

        with ThreadPoolExecutor(max_workers=NUM_WORKERS) as executor:
            results = list(executor.map(check_url, urls))

        new_df = pd.DataFrame(results, columns=['URL', 'Status Code', 'Load Time', 'Timestamp'])

        # if file does not exist write header 
        if not os.path.isfile(OUTPUT_FILE):
            new_df.to_excel(OUTPUT_FILE, index=False)
        else: # else it exists so append without writing the header
            df = pd.read_excel(OUTPUT_FILE)
            df = pd.concat([df, new_df])
            df.to_excel(OUTPUT_FILE, index=False)

        print(f"Finished checking URLs. Sleeping for {DELAY} minutes.")
        time.sleep(DELAY * 60)

if __name__ == "__main__":
    main()

Our Python script consists of two main functions: check_url and main.

The check_url function is responsible for making an HTTP request to a given URL and measuring the response time. It records the HTTP status code, load time, and the current timestamp. If the URL is invalid or not reachable, it will record "Invalid URL" as the status code and "N/A" as the load time.

The main function is the driver function that reads URLs from a text file, uses a thread pool to check URLs concurrently, and logs the results in an Excel file. It runs in an infinite loop, sleeping for a specified number of minutes between each round of URL checks. It prints to the terminal whether it's currently working on URLs or sleeping.

Multithreading with ThreadPoolExecutor

The ThreadPoolExecutor class from the concurrent.futures module allows us to create a pool of worker threads. Each thread can check a URL independently of the others, which means we can check multiple URLs simultaneously. This is a significant speed-up if we have many URLs to check.

Logging Results in Excel

We use the pandas library to log results in an Excel file. Each time the script checks URLs, it appends the results to the Excel file rather than overwriting it. It does this by reading in the existing data, concatenating the new data, and then writing it out again.

You can run the script with:

shCopy codepoetry run python url_checker.py

Testing with unittest

Python's built-in unittest module allows us to write unit tests for our code. For our script, we could write tests to ensure that our check_url function returns the correct output given a specific input. We could also write tests to ensure that our script correctly reads URLs from a file, logs results in an Excel file, and sleeps for the correct amount of time.

In our use case, writing unit tests would be a bit tricky due to the nature of the operations we are performing (network requests, file operations, and sleeping). However, we can still write some basic tests to verify our check_url function.

Let's create a new file named test_url_checker.py in the project's root directory:

import unittest
from url_checker import check_url

class TestURLChecker(unittest.TestCase):
    def test_check_url(self):
        # Testing with a valid URL
        url, status_code, load_time, _ = check_url('https://www.google.com')
        self.assertEqual(url, 'https://www.google.com')
        self.assertEqual(status_code, 200)
        self.assertNotEqual(load_time, 'N/A')

        # Testing with an invalid URL
        url, status_code, load_time, _ = check_url('https://www.nonexistentwebsite123456.com')
        self.assertEqual(url, 'https://www.nonexistentwebsite123456.com')
        self.assertEqual(status_code, 'Invalid URL')
        self.assertEqual(load_time, 'N/A')

if __name__ == '__main__':
    unittest.main()

This script tests the check_url function with both a valid URL and an invalid URL. For the valid URL, we expect the status code to be 200 and the load time to not be 'N/A'. For the invalid URL, we expect the status code to be 'Invalid URL' and the load time to be 'N/A'.

You can run the tests with:

poetry run python test_url_checker.py

Remember to replace 'https://www.google.com' and 'https://www.nonexistentwebsite123456.com' with URLs of your choice.

Note: Testing with network requests can be unpredictable as the status code may vary based on network conditions, server availability, etc. For more reliable tests, you might want to consider using a library like responses to mock HTTP requests.

In conclusion, the combination of Python with multithreading, Poetry, and unittest provides a powerful toolkit for building a URL monitor. This application is a robust, efficient solution for checking the availability and response time of a list of websites and logging the results for analysis. Happy coding!

Building a URL Monitor in Python

With Multithreading, Poetry and Unit Testing

Table of contents

Project Setup with Poetry

The Python Script

Multithreading with ThreadPoolExecutor

Logging Results in Excel

Testing with unittest

Building a URL Monitor in Python

With Multithreading, Poetry and Unit Testing

Table of contents

Project Setup with Poetry

The Python Script

Multithreading with ThreadPoolExecutor

Logging Results in Excel

Testing with unittest

Did you find this article valuable?