I have a very large xml file (almost 1 gb) I need to split the xml file into 3 smaller files. All with the same headers. I would like to do in Python

Question

I'm opening the file with the code below, but it won't open because it is too big.

from xml.dom import minidom
Test_file = open("C:\\Users\\samue\\OneDrive\\Desktop\\mopar.xml","r", encoding="utf8")
xmldox = minidom.parse(Test_file)

Test_file.close()

def printNode(node):
    print (node)
    for child in node.childNodes:
        printNode(child)
        
printNode(xmldoc.documentElement)

https://stackoverflow.com/a/326541/2834978 – LMC May 12 '22 at 17:31 — LMC, May 12 '22 at 17:31

score 0 · Answer 1 · answered May 12 '22 at 17:42

although I don't see the error messages like the call stack you pasted, I suppose your code maybe failed at the second or the third line.

Have you tried to parse your xml file by xml.etree.cElementTree?

For example, use the codes below and you can know how long ET parses your XML file.

import os
import time
import xml.etree.cElementTree as ET

def read_xml_file(xml_file, element):
    """
    Parse the xml file to xml.etree.cElementTree
    """
    tree = ET.parse(xml_file)
    root = tree.getroot()
    number_of_element = len(root.findall(element))
    return '{:,.0f}'.format(number_of_element)

start_time = time.perf_counter()
counter = read_xml_file(xml_file_name, 'ProteinEntry/header') # the element here depends on your XML header tag 
end_time = time.perf_counter()
total_time = round(end_time - start_time, 2)
print(f'xml.etree.cElementTree - Total time taken:[{total_time}] seconds to identify the number of elements: [{counter}]')

I have a very large xml file (almost 1 gb) I need to split the xml file into 3 smaller files. All with the same headers. I would like to do in Python

1 Answers1