I'm trying to process a 4.6GB XML file with the following code:
context = ET.iterparse(file_name_data, events=("start", "end"))
in_pandcertificaat = False
pandcertificaat = {}
pandcertificaten = []
number_of_pickles = 0
for index, (event, elem) in enumerate(context):
if event == "start" and elem.tag == "Pandcertificaat":
in_pandcertificaat = True
pandcertificaat = {} # Initiate empty pandcertificaat.
continue
elif event == "end" and elem.tag == "Pandcertificaat":
in_pandcertificaat = False
pandcertificaten.append(pandcertificaat)
continue
elif in_pandcertificaat:
pandcertificaat[elem.tag] = elem.text
else:
pass
if index % iteration_interval_for_internal_memory_check == 0:
print(f"index = {index:.2e}")
process = psutil.Process(os.getpid())
internal_memory_usage_in_mb = process.memory_info().rss / (1024 * 1024)
print(f"Memory usage = {internal_memory_usage_in_mb:.2f} * MB.")
if internal_memory_usage_in_mb > internal_memory_usage_limit_for_splitting_data_in_mb:
df = pd.DataFrame(pandcertificaten)
path_temporary_storage_data_frame = f"{base_path_temporary_storage_data_frame}{number_of_pickles}.{file_name_extension_pickle}"
df.to_pickle(path_temporary_storage_data_frame)
print(f"Intermediately saving data frame to {path_temporary_storage_data_frame} to save internal memory.")
number_of_pickles += 1
pandcertificaten.clear()
gc.collect()
As you can see I try to save RAM by intermediately saving the Pandas data frames to files on disk but for some reason the RAM usage still keeps increasing. Even after adding gc.collect()
, hopefully forcing garbage collection.
This is an example of the output I'm getting:
index = 3.70e+07
Memory usage = 2876.80 * MB.
Intermediately saving data frame to data_frame_pickles/26.pickle to save internal memory.
index = 3.80e+07
Memory usage = 2946.93 * MB.
Intermediately saving data frame to data_frame_pickles/27.pickle to save internal memory.
index = 3.90e+07
Memory usage = 3017.31 * MB.
Intermediately saving data frame to data_frame_pickles/28.pickle to save internal memory.
What am I doing wrong?
UPDATE 2023-03-17, 14:37.
The problem just got weirder. If I comment everything in the for loop, the RAM usage stillm keeps increasing in time. I believe it follows that there is a problem with iterparse
. And the out of RAM problem occurs when using lxml
or xml.etree.ElementTree
. I did not try the XMLPullParser
yet, as suggested by @Hermann12.