Check validity on big XML files with Python

And use a XSD when it's when it is available

I wanted to check a XML file (250MB in size) but I realised that a normal editor would not really would be an option.

I was thinking about Python and was about to write a piece of code to actually check big XML files.

So the assignment is:

Write a Python script that can check big XML files on validity. I want to be able to put in the xml and xsd file by input. The script has to show the progress of the file check process and show all problems at the end on screen and in a log file. If there is no xsd file just check the xml file without a schema.

This script checks the validity of large XML files using the lxml library, which is more efficient for parsing large files than the built-in xml library. It shows the progress of the file check process and logs all problems at the end on screen and in a log file.

import os
import sys
import time
from lxml import etree
from tqdm import tqdm

def validate_xml_file(xml_file, xsd_file):
    try:
        schema_doc = etree.parse(xsd_file)
        schema = etree.XMLSchema(schema_doc)
    except etree.XMLSchemaParseError as e:
        print(f"Error parsing XSD schema: {e}")
        sys.exit(1)

    context = etree.iterparse(xml_file, events=("start", "end"), huge_tree=True)
    problems = []
    total_elements = 0

    with tqdm(total=os.path.getsize(xml_file), unit="B", unit_scale=True, unit_divisor=1024, ncols=100) as progress_bar:
        try:
            for event, elem in context:
                if event == "start":
                    if schema.validate(elem) == False:
                        problems.append({"line": elem.sourceline, "message": schema.error_log.last_error})
                        schema.error_log.clear()
                    total_elements += 1
                    progress_bar.update(len(etree.tostring(elem)))
                elem.clear()

                while elem.getprevious() is not None:
                    del elem.getparent()[0]
        except etree.XMLSyntaxError as e:
            print(f"Error parsing XML file: {e}")
            sys.exit(1)

    return problems, total_elements

if __name__ == "__main__":
    xml_file = input("Enter the path to the XML file: ")
    xsd_file = input("Enter the path to the XSD file: ")
    log_file = "xml_validation_log.txt"

    start_time = time.time()
    problems, total_elements = validate_xml_file(xml_file, xsd_file)
    end_time = time.time()

    with open(log_file, "w") as log:
        log.write(f"Checked {total_elements} elements in {end_time - start_time:.2f} seconds.\n\n")
        if len(problems) == 0:
            log.write("No problems found.\n")
            print("No problems found.")
        else:
            log.write("Problems found:\n")
            for problem in problems:
                log.write(f"Line {problem['line']}: {problem['message']}\n")
                print(f"Line {problem['line']}: {problem['message']}")

Before running the script, ensure you have the lxml and tqdm libraries installed. You can install them using pip:

pip install lxml tqdm

To use the script, you need to provide the path to the XML file you want to validate and the path to the XSD schema file. The script will output the progress of the file check process and will display and log any problems found. The log file will be named "xml_validation_log.txt".

Did you find this article valuable?

Support Theo van der Sluijs by becoming a sponsor. Any amount is appreciated!