Photo by Nick Morrison on Unsplash
Get all site urls, without images, javascript and css files
Put them in a csv file
So the other day I created.... (with a bit of help) a script for me to get all site URLs and put them in a csv file. All urls containing non-ascii chars, underscores or capitals were put in a different file.
But... I also got images, javascript files and CSS files. I just wanted Urls.
So the assignment is:
Write a python script that recursively checks a site and gives me all the URLs, put all the URLs in a CSV file called "all_urls.csv" and put the URLs with non-asscii characters or underscores or capitals in a CSV file named "wrong_urls.csv. No need to scan URLs that are outside the domain. Ask for a site domain with input when the scripts start. Skip images, javascript files, and stylesheet files.
Now... before I hit enter, it came to me. I have to do some commenting and testing
So the assignment became
Write a python script that recursively checks a site and gives me all the URLs, put all the URLs in a csv file called "all_urls.csv" and put the urls with non-asscii characters or underscores or capitals in a csv file named "wrong_urls.csv. No need to scan urls that are outside the domain. Ask for a site domain with input when the scripts start. Skip images, javascript files, and stylesheet files. Write some inline comments and a test script.
And off we go!
To achieve this, you can use the 'requests' and 'BeautifulSoup' libraries to fetch and parse the HTML content of the domain. You can then collect all the URLs within the domain, excluding images, JavaScript files, and stylesheets, and write them to the respective CSV files based on their characteristics.
Before you proceed, make sure you have installed the required libraries by running:
pip install requests beautifulsoup4
Here's the Python script for your requirement:
import requests
from bs4 import BeautifulSoup
import csv
from urllib.parse import urljoin
import re
# Recursively fetches all links within the domain
def get_all_links_recursive(url, domain, visited=None):
if visited is None:
visited = set()
if url in visited:
return visited
visited.add(url)
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
for link in soup.find_all("a"):
href = link.get("href")
if href and not is_skipped_file(href):
abs_url = urljoin(domain, href)
if abs_url.startswith(domain) and abs_url not in visited:
visited = get_all_links_recursive(abs_url, domain, visited)
return visited
# Checks if the URL contains non-ASCII characters, underscores, or uppercase characters
def is_wrong_url(url):
if re.search(r"[^\x00-\x7F]|_|[A-Z]", url):
return True
return False
# Checks if the URL corresponds to an image, JavaScript file, or stylesheet
def is_skipped_file(url):
ext = url.split(".")[-1].lower()
return ext in ("jpg", "jpeg", "png", "gif", "svg", "js", "css")
def main():
# Prompt user for domain URL
domain = input("Enter the domain URL: ")
# Recursively fetch all the links within the domain
all_links = get_all_links_recursive(domain, domain)
all_urls = []
wrong_urls = []
# Classify URLs
for link in all_links:
if is_wrong_url(link):
wrong_urls.append(link)
else:
all_urls.append(link)
# Write URLs to 'all_urls.csv'
with open("all_urls.csv", "w", newline="") as csvfile:
writer = csv.writer(csvfile)
writer.writerow(["URL"])
for url in all_urls:
writer.writerow([url])
# Write wrong URLs to 'wrong_urls.csv'
with open("wrong_urls.csv", "w", newline="") as csvfile:
writer = csv.writer(csvfile)
writer.writerow(["Wrong URL"])
for url in wrong_urls:
writer.writerow([url])
print("CSV files created: all_urls.csv, wrong_urls.csv")
if __name__ == "__main__":
main()
Creating tests for this script is a bit challenging since it involves web requests and the content of the web pages may change over time. However, you can create some basic tests for the helper functions is_wrong_url
and is_skipped_file
.
First, split the helper functions is_wrong_url
and is_skipped_file
from the main script into a separate Python file named script_helpers.py
. Then, import these functions in the main script using:
from script_helpers import is_wrong_url, is_skipped_file
Next, create a new Python file named test_script.py
and include the following test script using the unittest
library:
import unittest
from script_helpers import is_wrong_url, is_skipped_file
class TestScript(unittest.TestCase):
def test_is_wrong_url(self):
self.assertTrue(is_wrong_url("https://example.com/Test_Url"))
self.assertTrue(is_wrong_url("https://example.com/テスト"))
self.assertFalse(is_wrong_url("https://example.com/test-url"))
def test_is_skipped_file(self):
self.assertTrue(is_skipped_file("https://example.com/image.jpg"))
self.assertTrue(is_skipped_file("https://example.com/script.js"))
self.assertTrue(is_skipped_file("https://example.com/styles.css"))
self.assertFalse(is_skipped_file("https://example.com/test-url"))
if __name__ == "__main__":
unittest.main()
To run the tests, execute the test_script.py
file in the command line:
python test_script.py
These tests will verify the functionality of the is_wrong_url
and is_skipped_file
functions by checking if they correctly identify wrong URLs and skipped files, respectively.
Please note that these tests only cover a small part of the script's functionality. Testing the get_all_links_recursive
function is more complex due to its dependency on web requests and the dynamic nature of web page content. In practice, you might use a library like httpretty
or responses
to mock web requests and create tests for this function.