Is the Python garbage collection not working? Problematic processing of enormous XML files

Question

I'm trying to process a 4.6GB XML file with the following code:

context = ET.iterparse(file_name_data, events=("start", "end"))
in_pandcertificaat = False
pandcertificaat = {}
pandcertificaten = []
number_of_pickles = 0
for index, (event, elem) in enumerate(context):

    if event == "start" and elem.tag == "Pandcertificaat":
        in_pandcertificaat = True
        pandcertificaat = {}  # Initiate empty pandcertificaat.
        continue
    elif event == "end" and elem.tag == "Pandcertificaat":
        in_pandcertificaat = False
        pandcertificaten.append(pandcertificaat)
        continue
    elif in_pandcertificaat:
        pandcertificaat[elem.tag] = elem.text
    else:
        pass

    if index % iteration_interval_for_internal_memory_check == 0:
        print(f"index = {index:.2e}")
        process = psutil.Process(os.getpid())
        internal_memory_usage_in_mb = process.memory_info().rss / (1024 * 1024)
        print(f"Memory usage = {internal_memory_usage_in_mb:.2f} * MB.")

        if internal_memory_usage_in_mb > internal_memory_usage_limit_for_splitting_data_in_mb:
            df = pd.DataFrame(pandcertificaten)
            path_temporary_storage_data_frame = f"{base_path_temporary_storage_data_frame}{number_of_pickles}.{file_name_extension_pickle}"
            df.to_pickle(path_temporary_storage_data_frame)
            print(f"Intermediately saving data frame to {path_temporary_storage_data_frame} to save internal memory.")
            number_of_pickles += 1
            pandcertificaten.clear()
            gc.collect()

As you can see I try to save RAM by intermediately saving the Pandas data frames to files on disk but for some reason the RAM usage still keeps increasing. Even after adding gc.collect(), hopefully forcing garbage collection.

This is an example of the output I'm getting:

index = 3.70e+07
Memory usage = 2876.80 * MB.
Intermediately saving data frame to data_frame_pickles/26.pickle to save internal memory.


index = 3.80e+07
Memory usage = 2946.93 * MB.
Intermediately saving data frame to data_frame_pickles/27.pickle to save internal memory.


index = 3.90e+07
Memory usage = 3017.31 * MB.
Intermediately saving data frame to data_frame_pickles/28.pickle to save internal memory.

What am I doing wrong?

UPDATE 2023-03-17, 14:37.

The problem just got weirder. If I comment everything in the for loop, the RAM usage stillm keeps increasing in time. I believe it follows that there is a problem with iterparse. And the out of RAM problem occurs when using lxml or xml.etree.ElementTree. I did not try the XMLPullParser yet, as suggested by @Hermann12.

You can use the `tracemalloc` module to get more details about memory usage. https://docs.python.org/3/library/tracemalloc.html — Wim Coenen, Mar 16 '23 at 13:08
Resident set size does not necessarily decrease when memory is freed. https://stackoverflow.com/questions/59817055/resident-set-size-remains-the-same-after-delete — Wim Coenen, Mar 16 '23 at 13:10
@WimCoenen. I already tried to understand the RAM usage with `sys.getsizeof(variable)` for all the variables. The sizes don't increase, yet the RAM usage does...which is very strange to me. — Adriaan, Mar 16 '23 at 13:19
iterparse also needs extra steps to reduce memory usage: https://stackoverflow.com/questions/12160418/why-is-lxml-etree-iterparse-eating-up-all-my-memory — jqurious, Mar 16 '23 at 13:26
As an extra note, you're moving `pandcertificaten` to a `DataFrame` and clearing `pandcertificaten`. However, `df` never goes out of scope and is never explicitly deleted until it is overwritten in the next memory check. This means the memory didn't actually get freed, you just moved it to another data structure. That being said, checking the OS for RAM usage might not be the best indicator about what the garbage collector is doing. See https://stackoverflow.com/questions/15455048/releasing-memory-in-python — Axe319, Mar 16 '23 at 13:51
BTW - Do note `iterparse` is a new argument recently added to [**`pandas.read_xml`**](https://pandas.pydata.org/docs/reference/api/pandas.read_xml.html) in v1.5. See [docs](https://pandas.pydata.org/docs/user_guide/io.html#io-read-xml) which indicates reading in Wikipedia’s very large (12 GB+) latest article data dump! (As author, I can tell you this XML was tested using `read_xml` on a laptop of 8 GB RAM for both `lxml` and `etree` parsers.) — Parfait, Mar 16 '23 at 15:32
Consider explaining the problem you are trying to solve, so we can perhaps suggest alternative technologies. — Michael Kay, Mar 16 '23 at 15:58
For large files I recommend the non-blocking XMLPullParser. The results can be written into a sqlite3 database for later analysis. Pandas have some size limits. — Hermann12, Mar 16 '23 at 19:56
@Parfait. What a tremendous tip! Seems to work very good, but after minutes it still goes out of memory. Maybe because the data frame in which the data from the 4.6GB XML file is stored becomes to large for the RAM. I guess it's not intermediately saved to disk. — Adriaan, Mar 17 '23 at 13:36
Please show sample of your XML data and your attempt of `read_xml`. How many tags and attributes are you attempting to parse? Try reducing to test. Are you using an IDE or calling Python at command line? Are you running other apps? Usually the markup symbols make up a sizable portion of the XML size but parsed data will be much lower. — Parfait, Mar 18 '23 at 20:01

Is the Python garbage collection not working? Problematic processing of enormous XML files

0 Answers0