Checking unique 404 pages with Python

And put it in a CSV file

So yesterday I've talked about getting 404 pages and the linked source pages with python, today I want to talk about unique 404 pages.

As Google dictates that you have to have a clean site without internal 404 links the the yesterday script comes in handy when you want to find all internal pages that have links to 404 pages.

But you might also just want to clean up the 404 pages and redirect them to other pages.

So the assignment is:

Write a python script that recursively checks a site for unique 404 pages, the output should be put in a csv file. There is no need to scan urls that fall outside the domain.

And off we go!

To create a Python script that recursively checks a website for unique 404 pages and stores the results in a CSV file, you can use the BeautifulSoup and requests libraries to scrape the site and extract the links. Then, use a recursive function to traverse the site's links, checking for 404 errors. Finally, use the CSV module to save the results.

First, make sure you have installed the necessary libraries:

pip install beautifulsoup4
pip install requests

Here's the Python script:

import requests
from bs4 import BeautifulSoup
import csv
from urllib.parse import urlparse, urljoin

def is_valid_url(url):
    try:
        result = urlparse(url)
        return all([result.scheme, result.netloc])
    except ValueError:
        return False

def get_links(url):
    try:
        response = requests.get(url)
        soup = BeautifulSoup(response.content, "html.parser")
        links = [link.get("href") for link in soup.find_all("a")]
        return links
    except:
        return []

def find_404_errors(url, base_url, visited, errors):
    if not is_valid_url(url) or urlparse(url).netloc != base_url.netloc:
        return

    if url in visited:
        return

    visited.add(url)

    response = requests.get(url)
    if response.status_code == 404:
        errors.add(url)
        print(f"Found 404 error: {url}")

    if response.status_code != 200:
        return

    links = get_links(url)
    for link in links:
        absolute_link = urljoin(base_url.geturl(), link)
        find_404_errors(absolute_link, base_url, visited, errors)

def main(start_url):
    base_url = urlparse(start_url)
    visited = set()
    errors = set()

    find_404_errors(start_url, base_url, visited, errors)

    with open("404_unique_errors.csv", "w", newline="") as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(["URL"])
        for error in errors:
            writer.writerow([error])

if __name__ == "__main__":
    start_url = input("Enter the starting URL: ")
    main(start_url)

This script takes a starting URL and traverses the links within the domain, looking for 404 errors. The unique 404 errors are saved in a CSV file called "404_unique_errors.csv". Note that the script may take a long time to run, depending on the size of the website.

Did you find this article valuable?

Support Theo van der Sluijs by becoming a sponsor. Any amount is appreciated!