Photo by Emile Perron on Unsplash
Find images on your site without an ALT tag
And put them in a CSV file including the page the image is found.
So, I have been letting ChatGPT help me for over 3 days now.
Do not get me wrong! I like writing code, but some code, small pieces of code, ChatGPT comes in really handy.
I used to use ScreamingFrog for checking the SEO for all kinds of sites. But my license expired, so I was going to renew it.
And then….. the new price! £199 !!! 199!!!!! That is 59 more than the last time. A year ago it was £140
£59 increase!! WTF!!! 40%, and yes they did not increase it for 8 years, but they could have increased it every year a bit right?
I've talked to them but they did not want to help out. So I started to write code to SEO scrape my sites with python. Well... actually ChatGPT is writing the code for me.
In my last two articles you could read about getting 404 pages and now I want to get images without a ALT tag.
Let's go! The assignment is:
Write a python script that searches for all images on a domain without a filled alt tag, the output should be punt in the csv file named "images_without_alt_tags.csv" with in the first column the image url and the second column should contain the url where the image is found. Ask me for the domain by using input and do not forget that this script has to be recursive.
And off we go!
To achieve this, you can use the 'requests' and 'BeautifulSoup' libraries to fetch and parse the HTML content of the domain. You can then search for images with empty or missing 'alt' attributes and write the results to a CSV file.
Before you proceed, make sure you have installed the required libraries by running:
pip install requests beautifulsoup4
Here's the Python script for your requirement:
import requests
from bs4 import BeautifulSoup
import csv
from urllib.parse import urljoin
def get_all_links_recursive(url, domain, visited=None):
if visited is None:
visited = set()
if url in visited:
return visited
visited.add(url)
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
for link in soup.find_all("a"):
href = link.get("href")
if href:
abs_url = urljoin(domain, href)
if abs_url.startswith(domain) and abs_url not in visited:
visited = get_all_links_recursive(abs_url, domain, visited)
return visited
def find_images_without_alt_tags(url):
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
images = []
for img in soup.find_all("img"):
alt = img.get("alt")
src = img.get("src")
if src and (alt is None or alt.strip() == ""):
images.append((src, url))
return images
def main():
domain = input("Enter the domain URL: ")
all_links = get_all_links_recursive(domain, domain)
images_without_alt_tags = []
for link in all_links:
images_without_alt_tags.extend(find_images_without_alt_tags(link))
with open("images_without_alt_tags.csv", "w", newline="") as csvfile:
writer = csv.writer(csvfile)
writer.writerow(["Image URL", "Page URL"])
for img in images_without_alt_tags:
writer.writerow(img)
print("CSV file created: images_without_alt_tags.csv")
if __name__ == "__main__":
main()
This updated script will:
Prompt the user for the domain URL.
Recursively fetch all the links within the domain using the
get_all_links_recursive
function.Iterate through all the links and find images with empty or missing 'alt' attributes.
Write the image URLs and the corresponding page URLs to a CSV file named 'images_without_alt_tags.csv'.
Please note that this script still assumes all URLs within the domain are either absolute URLs or relative URLs that can be resolved using the domain URL. Additionally, it does not handle JavaScript rendered content, which may cause some images rendered by JavaScript to be missed.