Get all site urls, without images, javascript and css files

Put them in a csv file

Play this article

So the other day I created.... (with a bit of help) a script for me to get all site URLs and put them in a csv file. All urls containing non-ascii chars, underscores or capitals were put in a different file.

But... I also got images, javascript files and CSS files. I just wanted Urls.

So the assignment is:

Write a python script that recursively checks a site and gives me all the URLs, put all the URLs in a CSV file called "all_urls.csv" and put the URLs with non-asscii characters or underscores or capitals in a CSV file named "wrong_urls.csv. No need to scan URLs that are outside the domain. Ask for a site domain with input when the scripts start. Skip images, javascript files, and stylesheet files.

Now... before I hit enter, it came to me. I have to do some commenting and testing

So the assignment became

Write a python script that recursively checks a site and gives me all the URLs, put all the URLs in a csv file called "all_urls.csv" and put the urls with non-asscii characters or underscores or capitals in a csv file named "wrong_urls.csv. No need to scan urls that are outside the domain. Ask for a site domain with input when the scripts start. Skip images, javascript files, and stylesheet files. Write some inline comments and a test script.

And off we go!

To achieve this, you can use the 'requests' and 'BeautifulSoup' libraries to fetch and parse the HTML content of the domain. You can then collect all the URLs within the domain, excluding images, JavaScript files, and stylesheets, and write them to the respective CSV files based on their characteristics.

Before you proceed, make sure you have installed the required libraries by running:

pip install requests beautifulsoup4

Here's the Python script for your requirement:

import requests
from bs4 import BeautifulSoup
import csv
from urllib.parse import urljoin
import re

# Recursively fetches all links within the domain
def get_all_links_recursive(url, domain, visited=None):
    if visited is None:
        visited = set()

    if url in visited:
        return visited

    visited.add(url)

    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")

    for link in soup.find_all("a"):
        href = link.get("href")
        if href and not is_skipped_file(href):
            abs_url = urljoin(domain, href)
            if abs_url.startswith(domain) and abs_url not in visited:
                visited = get_all_links_recursive(abs_url, domain, visited)

    return visited

# Checks if the URL contains non-ASCII characters, underscores, or uppercase characters
def is_wrong_url(url):
    if re.search(r"[^\x00-\x7F]|_|[A-Z]", url):
        return True
    return False

# Checks if the URL corresponds to an image, JavaScript file, or stylesheet
def is_skipped_file(url):
    ext = url.split(".")[-1].lower()
    return ext in ("jpg", "jpeg", "png", "gif", "svg", "js", "css")

def main():
    # Prompt user for domain URL
    domain = input("Enter the domain URL: ")

    # Recursively fetch all the links within the domain
    all_links = get_all_links_recursive(domain, domain)

    all_urls = []
    wrong_urls = []

    # Classify URLs
    for link in all_links:
        if is_wrong_url(link):
            wrong_urls.append(link)
        else:
            all_urls.append(link)

    # Write URLs to 'all_urls.csv'
    with open("all_urls.csv", "w", newline="") as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(["URL"])
        for url in all_urls:
            writer.writerow([url])

    # Write wrong URLs to 'wrong_urls.csv'
    with open("wrong_urls.csv", "w", newline="") as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(["Wrong URL"])
        for url in wrong_urls:
            writer.writerow([url])

    print("CSV files created: all_urls.csv, wrong_urls.csv")

if __name__ == "__main__":
    main()

Creating tests for this script is a bit challenging since it involves web requests and the content of the web pages may change over time. However, you can create some basic tests for the helper functions is_wrong_url and is_skipped_file.

First, split the helper functions is_wrong_url and is_skipped_file from the main script into a separate Python file named script_helpers.py. Then, import these functions in the main script using:

from script_helpers import is_wrong_url, is_skipped_file

Next, create a new Python file named test_script.py and include the following test script using the unittest library:

import unittest
from script_helpers import is_wrong_url, is_skipped_file

class TestScript(unittest.TestCase):

    def test_is_wrong_url(self):
        self.assertTrue(is_wrong_url("https://example.com/Test_Url"))
        self.assertTrue(is_wrong_url("https://example.com/テスト"))
        self.assertFalse(is_wrong_url("https://example.com/test-url"))

    def test_is_skipped_file(self):
        self.assertTrue(is_skipped_file("https://example.com/image.jpg"))
        self.assertTrue(is_skipped_file("https://example.com/script.js"))
        self.assertTrue(is_skipped_file("https://example.com/styles.css"))
        self.assertFalse(is_skipped_file("https://example.com/test-url"))

if __name__ == "__main__":
    unittest.main()

To run the tests, execute the test_script.py file in the command line:

python test_script.py

These tests will verify the functionality of the is_wrong_url and is_skipped_file functions by checking if they correctly identify wrong URLs and skipped files, respectively.

Please note that these tests only cover a small part of the script's functionality. Testing the get_all_links_recursive function is more complex due to its dependency on web requests and the dynamic nature of web page content. In practice, you might use a library like httpretty or responses to mock web requests and create tests for this function.

Did you find this article valuable?

Support Theo van der Sluijs by becoming a sponsor. Any amount is appreciated!