Get ScreamingFrog Data with a Python script

Without using ScreamingFrog

The last few days I have been writing about getting URL data from a site with python.

A few years back I started to write a python script to get URL data to analyze as ScreamingFrog does. At a certain point, I lost track of the script and figured that it would be easier to just use ScreamingFrog.

Last week I had to renew my screaming frog subscription (you pay per year and cannot use until you renew your yearly Subscription again) but the price went up from 140 to 199 !!

So it got me started thinking about the python script again, and I started writing again, with a little help from chatGPT

So last night I was thinking, let's try to combine more Url information in one Excel file.

So the assignment is:

Create a python script with which you can recursively scan all URLs of a site, do not scan images, videos, javascript or css files. You put the results in a xslt file called "url-scan.xlst" In the file you put per url:
in the first column url,
in the next column the response code,
in the next column the title,
in the next column the number of characters of the title,
in the next column the number of characters of the meta description,
in the next column the number of H1 tags,
in the next column if the title is equal to the first H1 tag with True or False,
in the next column the number of H2 tags,
in the next column the number of H3 tags,
in the next column the word count of the content,
in the next column the number of urls via HTTPS,
in the next column the number of urls via HTTP,
in the next column the number of images in the page without an ALT attribute,
in the next column the number of images in the page without an ALT title
You do not scan urls outside the site.
Ask for the base_url with input and check if it is a valid url

And off we go!

To create a Python script that accomplishes these tasks, you can use the following code. This script uses the requests, BeautifulSoup, and lxml libraries. Please make sure to install them using pip before running the script:

pip install requests beautifulsoup4 lxml openpyxl

Here's the Python script:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, urljoin
import re
import openpyxl
from openpyxl import Workbook

def is_valid_url(url):
    try:
        parsed_url = urlparse(url)
        return all([parsed_url.scheme, parsed_url.netloc])
    except ValueError:
        return False

def is_internal_url(base_url, url):
    return urlparse(url).netloc == urlparse(base_url).netloc

def get_all_links(url, soup):
    links = set()
    for a_tag in soup.find_all("a"):
        href = a_tag.attrs.get("href")
        if href == "" or href is None:
            continue

        full_url = urljoin(url, href)
        if is_internal_url(url, full_url) and is_valid_url(full_url) and not any(re.match(pattern, full_url) for pattern in ignore_patterns):
            links.add(full_url)

    return links

def analyze_url(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "lxml")

    title_tag = soup.find("title")
    title = title_tag.text if title_tag else ""
    title_count = len(title)

    meta_description = soup.find("meta", attrs={"name": "description"})
    meta_description_count = len(meta_description["content"]) if meta_description and "content" in meta_description.attrs else 0

    h1_tags = soup.find_all("h1")
    h1_count = len(h1_tags)
    first_h1 = h1_tags[0].text if h1_tags else ""
    title_equals_first_h1 = title == first_h1

    h2_count = len(soup.find_all("h2"))
    h3_count = len(soup.find_all("h3"))

    content = soup.body.text if soup.body else ""
    word_count = len(content.split())

    https_links = sum(1 for link in get_all_links(url, soup) if urlparse(link).scheme == "https")
    http_links = sum(1 for link in get_all_links(url, soup) if urlparse(link).scheme == "http")

    img_tags = soup.find_all("img")
    img_without_alt = sum(1 for img in img_tags if "alt" not in img.attrs)
    img_without_title = sum(1 for img in img_tags if "title" not in img.attrs)

    return [
        url, response.status_code, title, title_count, meta_description_count, h1_count, title_equals_first_h1,
        h2_count, h3_count, word_count, https_links, http_links, img_without_alt, img_without_title
    ]

def crawl(url, max_depth=5, depth=1):
    global visited_urls
    global results
    if depth > max_depth or url in visited_urls:
        return

    visited_urls.add(url)
    try:
        result = analyze_url(url)
        results.append(result)
        print(f"[{depth}] {url}")

        soup = BeautifulSoup(requests.get(url).content, "lxml")
        for link in get_all_links(url, soup):
            crawl(link, max_depth, depth + 1)
    except Exception as e:
        print(f"Error processing {url}: {e}")

def save_to_xlsx(results, filename):
    wb = Workbook()
    ws = wb.active
    headers = [
        "URL", "Response Code", "Title", "Title Length", "Meta Description Length",
        "H1 Count", "Title Equals First H1", "H2 Count", "H3 Count", "Word Count",
        "HTTPS Links", "HTTP Links", "Images Without ALT", "Images Without Title"
    ]

    for col_num, header in enumerate(headers, 1):
        ws.cell(row=1, column=col_num, value=header)

    for row_num, result in enumerate(results, 2):
        for col_num, value in enumerate(result, 1):
            ws.cell(row=row_num, column=col_num, value=value)

    wb.save(filename)

base_url = input("Please enter the base URL: ").strip()

while not is_valid_url(base_url):
    print("Invalid URL. Please enter a valid URL.")
    base_url = input("Please enter the base URL: ").strip()

ignore_patterns = [r"\.(jpg|jpeg|png|gif|svg|js|css|ico|mp4|webm|pdf)$"]

visited_urls = set()
results = []

crawl(base_url)

save_to_xlsx(results, "url-scan.xlsx")

print("URL scan complete. Results saved to 'url-scan.xlsx'.")

This script completes the crawling process and saves the results to an "url-scan.xlsx" file. Also, please note that the output file format is xlsx instead of xlst, as xlsx is the correct format for Excel files. The script asks the user for the base URL using the input() function, and checks if it's valid using the is_valid_url() function. If the URL is invalid, it asks the user again until a valid URL is provided.

Did you find this article valuable?

Support Theo van der Sluijs by becoming a sponsor. Any amount is appreciated!