Photo by Shahadat Rahman on Unsplash
Check validity on big XML files with Python
And use a XSD when it's when it is available
I wanted to check a XML file (250MB in size) but I realised that a normal editor would not really would be an option.
I was thinking about Python and was about to write a piece of code to actually check big XML files.
So the assignment is:
Write a Python script that can check big XML files on validity. I want to be able to put in the xml and xsd file by input. The script has to show the progress of the file check process and show all problems at the end on screen and in a log file. If there is no xsd file just check the xml file without a schema.
This script checks the validity of large XML files using the lxml
library, which is more efficient for parsing large files than the built-in xml
library. It shows the progress of the file check process and logs all problems at the end on screen and in a log file.
import os
import sys
import time
from lxml import etree
from tqdm import tqdm
def validate_xml_file(xml_file, xsd_file):
try:
schema_doc = etree.parse(xsd_file)
schema = etree.XMLSchema(schema_doc)
except etree.XMLSchemaParseError as e:
print(f"Error parsing XSD schema: {e}")
sys.exit(1)
context = etree.iterparse(xml_file, events=("start", "end"), huge_tree=True)
problems = []
total_elements = 0
with tqdm(total=os.path.getsize(xml_file), unit="B", unit_scale=True, unit_divisor=1024, ncols=100) as progress_bar:
try:
for event, elem in context:
if event == "start":
if schema.validate(elem) == False:
problems.append({"line": elem.sourceline, "message": schema.error_log.last_error})
schema.error_log.clear()
total_elements += 1
progress_bar.update(len(etree.tostring(elem)))
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
except etree.XMLSyntaxError as e:
print(f"Error parsing XML file: {e}")
sys.exit(1)
return problems, total_elements
if __name__ == "__main__":
xml_file = input("Enter the path to the XML file: ")
xsd_file = input("Enter the path to the XSD file: ")
log_file = "xml_validation_log.txt"
start_time = time.time()
problems, total_elements = validate_xml_file(xml_file, xsd_file)
end_time = time.time()
with open(log_file, "w") as log:
log.write(f"Checked {total_elements} elements in {end_time - start_time:.2f} seconds.\n\n")
if len(problems) == 0:
log.write("No problems found.\n")
print("No problems found.")
else:
log.write("Problems found:\n")
for problem in problems:
log.write(f"Line {problem['line']}: {problem['message']}\n")
print(f"Line {problem['line']}: {problem['message']}")
Before running the script, ensure you have the lxml
and tqdm
libraries installed. You can install them using pip:
pip install lxml tqdm
To use the script, you need to provide the path to the XML file you want to validate and the path to the XSD schema file. The script will output the progress of the file check process and will display and log any problems found. The log file will be named "xml_validation_log.txt".