So more screamingfrog scraping in Python. Let's get all URLs from a site and when the URL contains Non-ascii characters or underscores or uppercase characters put them in a separate CSV file.
I guess I'm up to the challenge, so the assignment is:
Write a python script that recursively checks a site and get all Urls, put all the URLs in a CSV file named "all_urls.csv" and put the urls that have Non asscii characters or underscores or uppercase characters in a csv file named "wrong_urls.csv. There is no need to scan urls that fall outside the domain. Ask for a site domain with input when the scripts starts.
And off we go!
To achieve this, you can use the 'requests' and 'BeautifulSoup' libraries to fetch and parse the HTML content of the domain. You can then collect all the URLs within the domain and write them to the respective CSV files based on their characteristics.
Before you proceed, make sure you have installed the required libraries by running:
pip install requests beautifulsoup4
Here's the Python script for your requirement:
import requests from bs4 import BeautifulSoup import csv from urllib.parse import urljoin import re def get_all_links_recursive(url, domain, visited=None): if visited is None: visited = set() if url in visited: return visited visited.add(url) page = requests.get(url) soup = BeautifulSoup(page.content, "html.parser") for link in soup.find_all("a"): href = link.get("href") if href: abs_url = urljoin(domain, href) if abs_url.startswith(domain) and abs_url not in visited: visited = get_all_links_recursive(abs_url, domain, visited) return visited def is_wrong_url(url): if re.search(r"[^\x00-\x7F]|_|[A-Z]", url): return True return False def main(): domain = input("Enter the domain URL: ") all_links = get_all_links_recursive(domain, domain) all_urls =  wrong_urls =  for link in all_links: if is_wrong_url(link): wrong_urls.append(link) else: all_urls.append(link) with open("all_urls.csv", "w", newline="") as csvfile: writer = csv.writer(csvfile) writer.writerow(["URL"]) for url in all_urls: writer.writerow([url]) with open("wrong_urls.csv", "w", newline="") as csvfile: writer = csv.writer(csvfile) writer.writerow(["Wrong URL"]) for url in wrong_urls: writer.writerow([url]) print("CSV files created: all_urls.csv, wrong_urls.csv") if __name__ == "__main__": main()
This script does the following:
Prompts the user for the domain URL.
Recursively fetches all the links within the domain using the
Classifies the URLs as wrong if they contain non-ASCII characters, underscores, or uppercase characters.
Writes the URLs to 'all_urls.csv' and 'wrong_urls.csv' files based on their classification.