Get all site urls with python

And put urls with Non ascii characters or underscores or uppercase characters in a seperate csv file

So more screamingfrog scraping in Python. Let's get all URLs from a site and when the URL contains Non-ascii characters or underscores or uppercase characters put them in a separate CSV file.

I guess I'm up to the challenge, so the assignment is:

Write a python script that recursively checks a site and get all Urls, put all the URLs in a CSV file named "all_urls.csv" and put the urls that have Non asscii characters or underscores or uppercase characters in a csv file named "wrong_urls.csv. There is no need to scan urls that fall outside the domain. Ask for a site domain with input when the scripts starts.

And off we go!

To achieve this, you can use the 'requests' and 'BeautifulSoup' libraries to fetch and parse the HTML content of the domain. You can then collect all the URLs within the domain and write them to the respective CSV files based on their characteristics.

Before you proceed, make sure you have installed the required libraries by running:

pip install requests beautifulsoup4

Here's the Python script for your requirement:

import requests
from bs4 import BeautifulSoup
import csv
from urllib.parse import urljoin
import re

def get_all_links_recursive(url, domain, visited=None):
    if visited is None:
        visited = set()

    if url in visited:
        return visited

    visited.add(url)

    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")

    for link in soup.find_all("a"):
        href = link.get("href")
        if href:
            abs_url = urljoin(domain, href)
            if abs_url.startswith(domain) and abs_url not in visited:
                visited = get_all_links_recursive(abs_url, domain, visited)

    return visited

def is_wrong_url(url):
    if re.search(r"[^\x00-\x7F]|_|[A-Z]", url):
        return True
    return False

def main():
    domain = input("Enter the domain URL: ")

    all_links = get_all_links_recursive(domain, domain)

    all_urls = []
    wrong_urls = []

    for link in all_links:
        if is_wrong_url(link):
            wrong_urls.append(link)
        else:
            all_urls.append(link)

    with open("all_urls.csv", "w", newline="") as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(["URL"])
        for url in all_urls:
            writer.writerow([url])

    with open("wrong_urls.csv", "w", newline="") as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(["Wrong URL"])
        for url in wrong_urls:
            writer.writerow([url])

    print("CSV files created: all_urls.csv, wrong_urls.csv")

if __name__ == "__main__":
    main()

This script does the following:

  1. Prompts the user for the domain URL.

  2. Recursively fetches all the links within the domain using the get_all_links_recursive function.

  3. Classifies the URLs as wrong if they contain non-ASCII characters, underscores, or uppercase characters.

  4. Writes the URLs to 'all_urls.csv' and 'wrong_urls.csv' files based on their classification.

Please note that this script still assumes all URLs within the domain are either absolute URLs or relative URLs that can be resolved using the domain URL. Additionally, it does not handle JavaScript rendered content, which may cause some URLs rendered by JavaScript to be missed.

Did you find this article valuable?

Support Theo van der Sluijs by becoming a sponsor. Any amount is appreciated!