The case:
I'm attempting to read an XML file, extract a small amount of data from it using BeautifulSoup, add the data to a dictionary, close the file and then progress to the next file. Once I've extracted the data I need, the file should be closed and released from memory.
The problem:
The program will eventually halt with a memory error and task-manager clearly shows an increasing amount of memory consumption after each file, leading me to believe my files aren't being properly closed or released from memory. In my environment, this will happen after reading roughly 200 files.
Things I've tried without success:
Collect garbage with gc.collect() (Doesn't seem to make a difference)
Decompose the file with soup.decompose() (Doesn't seem to make a difference)
Various files of different sizes
SoupStrainer (Next to no difference with/without it)
2 "solutions" I found:
Force the script to restart itself after a while (Not optimal)
64-bit version and more physical ram (Not optimal)
Info about the files:
- Ranging from 100kb to 5mb in size
- 10.000 to 70.000 lines per file.
- Standard .xml formatting
Example of XML structure / snippet from file. (Can be up to 70.000 lines of this):
<!-- language: xml -->
<Partner>
<Language>en-US</Language>
<PartnerRole>stackoverflow1</PartnerRole>
<IsSalesAreaDependent>True</IsSalesAreaDependent>
<ContactPerson>
<ContactPerson>
<Language>en-US</Language>
</ContactPerson>
</ContactPerson>
<InheritFromSoldTo>True</InheritFromSoldTo>
<SalesAreaData>
<SalesAreaData>
<Language>en-US</Language>
<Valid>False</Valid>
<SalesOrganization>stackoverflow2</SalesOrganization>
<DistributionChannel>stackoverflow3</DistributionChannel>
<SalesDivision>stackoverflow4</SalesDivision>
<CustomerGroup />
<Currency>stackoverflow5</Currency>
<PriceGroup />
<PriceList>stackoverflow6</PriceList>
<ShippingConditions />
<Plant />
<PaymentTerms />
</SalesAreaData>
</SalesAreaData>
<CustomerHierarchy />
</Partner>
Code:
for fname in glob.glob(path+"/Quotes/**/*.quote"): #Further define path
with open(fname, encoding="utf8") as open_file:
gc.collect()
counter += 1
contents = open_file.read()
soup = BeautifulSoup(contents, 'lxml')
try:
results = ("("+str(counter)+") " + " Ref: " + soup.quickref.string + " Last modified: " + soup.modifieddate.string)
bsize = os.path.getsize(fname)
totalsize += bsize
tempdata = (soup.modifieddate.string, soup.quickref.string, soup.ownerusername.string, soup.companyname.string, soup.totalnetvalue.string, fname)
dictHolder[counter] = tempdata
except AttributeError:
results = "("+ str(counter) + ")" + "Invalid data / corrupted file, please check: " + fname
corruptCounter += 1
soup.decompose()
gc.collect()
print (results)
10/08/2020: Problem has been "solved" by switching to xml.etree.elementtree module, doesn't really count as an answer or a solution, but if someone in the future runs into the same problem and reads this, try the module above.