Find images on your site without an ALT tag

And put them in a CSV file including the page the image is found.

So, I have been letting ChatGPT help me for over 3 days now.

Do not get me wrong! I like writing code, but some code, small pieces of code, ChatGPT comes in really handy.

I used to use ScreamingFrog for checking the SEO for all kinds of sites. But my license expired, so I was going to renew it.

And then….. the new price! £199 !!! 199!!!!! That is 59 more than the last time. A year ago it was £140

£59 increase!! WTF!!! 40%, and yes they did not increase it for 8 years, but they could have increased it every year a bit right?

I've talked to them but they did not want to help out. So I started to write code to SEO scrape my sites with python. Well... actually ChatGPT is writing the code for me.

In my last two articles you could read about getting 404 pages and now I want to get images without a ALT tag.

Let's go! The assignment is:

Write a python script that searches for all images on a domain without a filled alt tag, the output should be punt in the csv file named "images_without_alt_tags.csv" with in the first column the image url and the second column should contain the url where the image is found. Ask me for the domain by using input and do not forget that this script has to be recursive.

And off we go!

To achieve this, you can use the 'requests' and 'BeautifulSoup' libraries to fetch and parse the HTML content of the domain. You can then search for images with empty or missing 'alt' attributes and write the results to a CSV file.

Before you proceed, make sure you have installed the required libraries by running:

pip install requests beautifulsoup4

Here's the Python script for your requirement:

import requests
from bs4 import BeautifulSoup
import csv
from urllib.parse import urljoin

def get_all_links_recursive(url, domain, visited=None):
    if visited is None:
        visited = set()

    if url in visited:
        return visited

    visited.add(url)

    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")

    for link in soup.find_all("a"):
        href = link.get("href")
        if href:
            abs_url = urljoin(domain, href)
            if abs_url.startswith(domain) and abs_url not in visited:
                visited = get_all_links_recursive(abs_url, domain, visited)

    return visited

def find_images_without_alt_tags(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")

    images = []
    for img in soup.find_all("img"):
        alt = img.get("alt")
        src = img.get("src")
        if src and (alt is None or alt.strip() == ""):
            images.append((src, url))
    return images

def main():
    domain = input("Enter the domain URL: ")

    all_links = get_all_links_recursive(domain, domain)

    images_without_alt_tags = []
    for link in all_links:
        images_without_alt_tags.extend(find_images_without_alt_tags(link))

    with open("images_without_alt_tags.csv", "w", newline="") as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(["Image URL", "Page URL"])
        for img in images_without_alt_tags:
            writer.writerow(img)

    print("CSV file created: images_without_alt_tags.csv")

if __name__ == "__main__":
    main()

This updated script will:

  1. Prompt the user for the domain URL.

  2. Recursively fetch all the links within the domain using the get_all_links_recursive function.

  3. Iterate through all the links and find images with empty or missing 'alt' attributes.

  4. Write the image URLs and the corresponding page URLs to a CSV file named 'images_without_alt_tags.csv'.

Please note that this script still assumes all URLs within the domain are either absolute URLs or relative URLs that can be resolved using the domain URL. Additionally, it does not handle JavaScript rendered content, which may cause some images rendered by JavaScript to be missed.

Did you find this article valuable?

Support Theo van der Sluijs by becoming a sponsor. Any amount is appreciated!