Get 404 and source pages with a python script

And put them in a csv file

So the other day I had to get all 404 pages from a website with the source pages and I wanted to have them in a csv file.

So I opened up Visual Studio Code and be thinking about what code I should put down.... and there was copilot! A great helper!!

So the assignment is:

Let's create a python script that recursively checks a site for 404 pages, the output should be put in a csv file with in the first column the 404 url ​​and in the second column should contain the url where to find the 404 url ​​as a link is. There is no need to scan urls that fall outside the domain.

And off we go!

To create a Python script that recursively checks a site for 404 pages, you can use the requests, BeautifulSoup, and csv libraries. Make sure you have the requests and BeautifulSoup libraries installed by running the following commands:

pip install requests
pip install beautifulsoup4

Here's a script that accomplishes the task:

import requests
from bs4 import BeautifulSoup
import csv
from urllib.parse import urlparse, urljoin

def is_valid(url):
    parsed = urlparse(url)
    return bool(parsed.netloc) and bool(parsed.scheme)

def get_all_links(url):
    try:
        response = requests.get(url)
        soup = BeautifulSoup(response.text, "html.parser")
        return [urljoin(url, link.get("href")) for link in soup.find_all("a")]
    except:
        return []

def crawl(url, domain, visited, csv_writer):
    if url in visited:
        return

    visited.add(url)
    print(f"Visiting: {url}")

    links = get_all_links(url)
    for link in links:
        if urlparse(link).netloc == domain:
            response = requests.head(link, allow_redirects=True)
            if response.status_code == 404:
                csv_writer.writerow([link, url])
            else:
                crawl(link, domain, visited, csv_writer)

def main(start_url, output_file):
    domain = urlparse(start_url).netloc
    visited = set()

    with open(output_file, "w", newline="", encoding="utf-8") as csvfile:
        csv_writer = csv.writer(csvfile)
        csv_writer.writerow(["404 URL", "Source URL"])
        crawl(start_url, domain, visited, csv_writer)

if __name__ == "__main__":
    start_url = "https://example.com"  # Replace with the website you want to crawl
    output_file = "404_errors.csv"

    if is_valid(start_url):
        main(start_url, output_file)
    else:
        print("Invalid starting URL. Please provide a valid URL.")

Replace https://example.com with the website you want to crawl. This script will crawl the website and store any 404 URLs along with their source URL in a CSV file named 404_errors.csv.

Please note that this script is relatively simple and might not cover all edge cases or handle very large websites efficiently. Be aware that web scraping may also violate some websites' terms of service, and you should always respect website owners' wishes and robots.txt files.

And it just worked!

Did you find this article valuable?

Support Theo van der Sluijs by becoming a sponsor. Any amount is appreciated!