Photo by Fatos Bytyqi on Unsplash
Get all site urls with python
And put urls with Non ascii characters or underscores or uppercase characters in a seperate csv file
So more screamingfrog scraping in Python. Let's get all URLs from a site and when the URL contains Non-ascii characters or underscores or uppercase characters put them in a separate CSV file.
I guess I'm up to the challenge, so the assignment is:
Write a python script that recursively checks a site and get all Urls, put all the URLs in a CSV file named "all_urls.csv" and put the urls that have Non asscii characters or underscores or uppercase characters in a csv file named "wrong_urls.csv. There is no need to scan urls that fall outside the domain. Ask for a site domain with input when the scripts starts.
And off we go!
To achieve this, you can use the 'requests' and 'BeautifulSoup' libraries to fetch and parse the HTML content of the domain. You can then collect all the URLs within the domain and write them to the respective CSV files based on their characteristics.
Before you proceed, make sure you have installed the required libraries by running:
pip install requests beautifulsoup4
Here's the Python script for your requirement:
import requests
from bs4 import BeautifulSoup
import csv
from urllib.parse import urljoin
import re
def get_all_links_recursive(url, domain, visited=None):
if visited is None:
visited = set()
if url in visited:
return visited
visited.add(url)
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
for link in soup.find_all("a"):
href = link.get("href")
if href:
abs_url = urljoin(domain, href)
if abs_url.startswith(domain) and abs_url not in visited:
visited = get_all_links_recursive(abs_url, domain, visited)
return visited
def is_wrong_url(url):
if re.search(r"[^\x00-\x7F]|_|[A-Z]", url):
return True
return False
def main():
domain = input("Enter the domain URL: ")
all_links = get_all_links_recursive(domain, domain)
all_urls = []
wrong_urls = []
for link in all_links:
if is_wrong_url(link):
wrong_urls.append(link)
else:
all_urls.append(link)
with open("all_urls.csv", "w", newline="") as csvfile:
writer = csv.writer(csvfile)
writer.writerow(["URL"])
for url in all_urls:
writer.writerow([url])
with open("wrong_urls.csv", "w", newline="") as csvfile:
writer = csv.writer(csvfile)
writer.writerow(["Wrong URL"])
for url in wrong_urls:
writer.writerow([url])
print("CSV files created: all_urls.csv, wrong_urls.csv")
if __name__ == "__main__":
main()
This script does the following:
Prompts the user for the domain URL.
Recursively fetches all the links within the domain using the
get_all_links_recursive
function.Classifies the URLs as wrong if they contain non-ASCII characters, underscores, or uppercase characters.
Writes the URLs to 'all_urls.csv' and 'wrong_urls.csv' files based on their classification.
Please note that this script still assumes all URLs within the domain are either absolute URLs or relative URLs that can be resolved using the domain URL. Additionally, it does not handle JavaScript rendered content, which may cause some URLs rendered by JavaScript to be missed.