2

The case:

I'm attempting to read an XML file, extract a small amount of data from it using BeautifulSoup, add the data to a dictionary, close the file and then progress to the next file. Once I've extracted the data I need, the file should be closed and released from memory.

The problem:

The program will eventually halt with a memory error and task-manager clearly shows an increasing amount of memory consumption after each file, leading me to believe my files aren't being properly closed or released from memory. In my environment, this will happen after reading roughly 200 files.

Things I've tried without success:

  • Collect garbage with gc.collect() (Doesn't seem to make a difference)

  • Decompose the file with soup.decompose() (Doesn't seem to make a difference)

  • Various files of different sizes

  • SoupStrainer (Next to no difference with/without it)

2 "solutions" I found:

  • Force the script to restart itself after a while (Not optimal)

  • 64-bit version and more physical ram (Not optimal)

Info about the files:

  • Ranging from 100kb to 5mb in size
  • 10.000 to 70.000 lines per file.
  • Standard .xml formatting

Example of XML structure / snippet from file. (Can be up to 70.000 lines of this):

<!-- language: xml -->
<Partner>
  <Language>en-US</Language>
  <PartnerRole>stackoverflow1</PartnerRole>
  <IsSalesAreaDependent>True</IsSalesAreaDependent>
  <ContactPerson>
    <ContactPerson>
      <Language>en-US</Language>
    </ContactPerson>
  </ContactPerson>
  <InheritFromSoldTo>True</InheritFromSoldTo>
  <SalesAreaData>
    <SalesAreaData>
      <Language>en-US</Language>
      <Valid>False</Valid>
      <SalesOrganization>stackoverflow2</SalesOrganization>
      <DistributionChannel>stackoverflow3</DistributionChannel>
      <SalesDivision>stackoverflow4</SalesDivision>
      <CustomerGroup />
      <Currency>stackoverflow5</Currency>
      <PriceGroup />
      <PriceList>stackoverflow6</PriceList>
      <ShippingConditions />
      <Plant />
      <PaymentTerms />
    </SalesAreaData>
  </SalesAreaData>
  <CustomerHierarchy />
</Partner>

Code:

for fname in glob.glob(path+"/Quotes/**/*.quote"): #Further define path

    with open(fname, encoding="utf8") as open_file:

        gc.collect()
        counter += 1
        contents = open_file.read()
        soup = BeautifulSoup(contents, 'lxml')

        try:
            results = ("("+str(counter)+") " + " Ref: " + soup.quickref.string + " Last modified: " + soup.modifieddate.string)
            bsize = os.path.getsize(fname)
            totalsize += bsize

            tempdata = (soup.modifieddate.string, soup.quickref.string, soup.ownerusername.string, soup.companyname.string, soup.totalnetvalue.string, fname)
            dictHolder[counter] = tempdata

        except AttributeError:

            results = "("+ str(counter) + ")" + "Invalid data / corrupted file, please check: " + fname
            corruptCounter += 1

        soup.decompose()
        gc.collect()
        print (results)

10/08/2020: Problem has been "solved" by switching to xml.etree.elementtree module, doesn't really count as an answer or a solution, but if someone in the future runs into the same problem and reads this, try the module above.

nordmanden
  • 340
  • 1
  • 10
  • Isn't the `with open` block supposed to be within the for loop? – Sarwagya Aug 08 '20 at 11:39
  • Nicely spotted. It is in the actual code, but I can't seem to figure out how to indent it on stackoverflow. If someone is knowledgeable with StackOverflow formatting, please indent that part, thank you. – nordmanden Aug 08 '20 at 11:43
  • Not sure why you need BeautifulSoup, but can directly use underlying engine, `lxml`, that can parse files without `open`. Also, [`.read` is memory-intensive](https://stackoverflow.com/a/8009942/1422451) and may not release back within `for` loops. Please post sample of XML so we can see tree. – Parfait Aug 10 '20 at 02:10
  • I've updated OP with an example of the XML structure. It may be worth noting that not only does each loop consume more memory, it also gets progressivly slower for each loop (despite the actual files not increasing in size or complexicity) I don't neccesarily "need" beatifulsoup, but I find it easy to work with in relation to my relatively beginnner skills in Python. – nordmanden Aug 10 '20 at 08:35
  • Update 10/08: The problem has been "solved" by switching module to 'xml.etree.ElementTree', otherwise the same or similar code. Switching to a different module doesn't really count as an answer, so I'm gonna leave this up in case someone finds this useful or figures it out. Total processing time went from ~2hours to ~5 minutes. – nordmanden Aug 10 '20 at 14:26

2 Answers2

1

try..

del contents

could be that the garbage can't remove it because there is still a reference counter against the object.

also try running this code....

sorted([(x, round(sys.getsizeof(globals().get(x)) / 1000000, 2)) for x in dir() if not x.startswith('_') and x not in sys.modules and x not in ['In', 'Out', 'exit', 'quit', 'get_ipython', 'ipython_vars']], key=lambda x: x[1], reverse=True)

it shows the size of objects in memory (in mb) - might give an indication of what's hogging the resources

ChrisM
  • 319
  • 5
  • 19
  • Thanks for the suggestion, unfortunately it didn't change anything. – nordmanden Aug 08 '20 at 10:59
  • strange, personally I would store in pandas - easier to debug. – ChrisM Aug 08 '20 at 11:08
  • Interestingly enough the code you shared clearly shows "content" being cleared each iteration, but the actual memory usage according to task manager still keeps increasing for each iteration. – nordmanden Aug 08 '20 at 11:27
  • `del contents` is probably unnecessary, as even though gc doesn't free the allocation, it gets overwritten every iteration (after which the old allocation is "freed"). – Sarwagya Aug 08 '20 at 11:29
1

I don't know much about beautifulsoup....but reading thousands of csv files with pandas and storing in dictionary have worked for me simply reading and adding it to the dictionary. You can try reading the files with pandas and check if the issue is coming around reading the 200th file..if that is the case I am assuming it is a RAM issue.

Granth
  • 325
  • 4
  • 17
  • I appreciate the suggestion, switching to a different module is not preferable, however, if it's the only solution I might have to. – nordmanden Aug 08 '20 at 11:00