Find articles without a meta description with python

and improve your site's SEO

So I have been writing a lot of blog posts on Hashnode. I really love the environment and its cheaper faster and better than a good Wordpress site.

But I sometimes forget to do write some good SEO descriptions. Most of the time I forget it completely.

But to check every article by hand is a bit to much work. I love to automate shit right?

So this is the assignment:

Write a Python script that takes a website URL as input, recursively scans all its content pages, and saves the URLs with missing meta descriptions or meta descriptions shorter than 100 characters in an Excel file. Use Poetry as a tool for dependency management and packaging, and write tests with the unittest library.

To get started, first install Poetry using the instructions provided on the official website: https://python-poetry.org/docs/#installation

Create a new project folder and navigate to it in the terminal:

mkdir website_crawler
cd website_crawler

Initialize the project with Poetry:

poetry init

Add the required dependencies (requests, beautifulsoup4, openpyxl) using Poetry:

poetry add requests beautifulsoup4 openpyxl

Create the website_crawler folder, which will contain the main script:

mkdir website_crawler

Inside the website_crawler folder, create the __init__.py file to make it a Python package:

touch website_crawler/__init__.py

Create the crawler.py file inside the website_crawler folder and add the following code:

# website_crawler/crawler.py

import requests
from bs4 import BeautifulSoup
from openpyxl import Workbook
from urllib.parse import urljoin, urlparse


def extract_links(url, soup):
    """Extract internal links on the given page and return them as a list of absolute URLs."""
    links = [a.get("href") for a in soup.find_all("a", href=True)]
    absolute_links = [urljoin(url, link) for link in links if urlparse(urljoin(url, link)).netloc == urlparse(url).netloc]
    return absolute_links


def check_meta_description(url):
    """Check if the meta description is missing or shorter than 100 characters."""
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    meta_description = soup.find("meta", attrs={"name": "description"})
    return meta_description is None or len(meta_description["content"]) < 100


def crawl_website(url):
    """Crawl the website recursively and return a list of URLs with missing or short meta descriptions."""
    visited = set()
    missing_meta_description = []

    def visit_page(url):
        nonlocal visited, missing_meta_description
        if url not in visited:
            visited.add(url)
            if check_meta_description(url):
                missing_meta_description.append(url)
            soup = BeautifulSoup(requests.get(url).text, "html.parser")
            links = extract_links(url, soup)
            for link in links:
                visit_page(link)

    visit_page(url)
    return missing_meta_description


def save_to_excel(urls, output_file):
    """Save the list of URLs to an Excel file."""
    wb = Workbook()
    ws = wb.active
    ws.append(["URL"])
    for url in urls:
        ws.append([url])
    wb.save(output_file)

Now, let's create the main.py file inside the website_crawler folder:

# website_crawler/main.py

import sys
from .crawler import crawl_website, save_to_excel

def main():
    if len(sys.argv) < 3:
        print("Usage: python -m website_crawler WEBSITE_URL OUTPUT_FILE")
        sys.exit(1)

    website_url = sys.argv[1]
    output_file = sys.argv[2]

    urls_missing_meta_description = crawl_website(website_url)
    save_to_excel(urls_missing_meta_description, output_file)

    print(f"URLs with missing or short meta descriptions saved in {output_file}")

if __name__ == "__main__":
    main()

Now, let's create a tests folder and a test_crawler.py file inside it:

mkdir tests
touch tests/test_crawler.py

Add the following test code to tests/test_crawler.py:

# tests/test_crawler.py

import unittest
from website_crawler.crawler import extract_links, check_meta_description

class TestCrawler(unittest.TestCase):

    def test_extract_links(self):
        html = """
        <html>
            <body>
                <a href="/page1">Page 1</a>
                <a href="/page2">Page 2</a>
                <a href="https://external.com/page">External</a>
            </body>
        </html>
        """
        soup = BeautifulSoup(html, "html.parser")
        links = extract_links("https://example.com", soup)
        expected_links = ["https://example.com/page1", "https://example.com/page2"]
        self.assertEqual(links, expected_links)

    def test_check_meta_description(self):
        html_missing = """
        <html>
            <head></head>
            <body></body>
        </html>
        """
        html_short = """
        <html>
            <head>
                <meta name="description" content="Short description">
            </head>
            <body></body>
        </html>
        """
        html_long = """
        <html>
            <head>
                <meta name="description" content="This is a long description that is more than 100 characters long, so it should not be considered as short.">
            </head>
            <body></body>
        </html>
        """
        self.assertTrue(check_meta_description(BeautifulSoup(html_missing, "html.parser")))
        self.assertTrue(check_meta_description(BeautifulSoup(html_short, "html.parser")))
        self.assertFalse(check_meta_description(BeautifulSoup(html_long, "html.parser")))

if __name__ == "__main__":
    unittest.main()

Finally, to run the tests, execute the following command in the terminal:

poetry run python -m unittest discover tests

To run the main script, use the following command:

poetry run python -m website_crawler WEBSITE_URL OUTPUT_FILE

In summary, the script consists of the following functions:

  • extract_links(url, soup): Extracts all internal links from a given page and returns them as a list of absolute URLs.

  • check_meta_description(url): Checks if a meta description is missing or shorter than 100 characters.

  • crawl_website(url): Recursively crawls a website and returns a list of URLs with missing or short meta descriptions.

  • save_to_excel(urls, output_file): Saves the list of URLs to an Excel file.

The test suite in tests/test_crawler.py contains tests for extract_links and check_meta_description functions.

Did you find this article valuable?

Support Theo van der Sluijs by becoming a sponsor. Any amount is appreciated!