3

I have a dilemma.

I need to read very large XML files from all kinds of sources, so the files are often invalid XML or malformed XML. I still must be able to read the files and extract some info from them. I do need to get tag information, so I need XML parser.

Is it possible to use Beautiful Soup to read the data as a stream instead of the whole file into memory?

I tried to use ElementTree, but I cannot because it chokes on any malformed XML.

If Python is not the best language to use for this project please add your recommendations.

alig227
  • 327
  • 1
  • 6
  • 17
  • lxml has a stream api, you can check the documentations here: https://lxml.de/3.6/parsing.html#incremental-event-parsing – spiralmoon Jul 01 '20 at 05:21

1 Answers1

3

Beautiful Soup has no streaming API that I know of. You have, however, alternatives.

The classic approach for parsing large XML streams is using an event-oriented parser, namely SAX. In python, xml.sax.xmlreader. It will not choke with malformed XML. You can avoid erroneous portions of the file and extract information from the rest.

SAX, however, is low-level and a bit rough around the edges. In the context of python, it feels terrible.

The xml.etree.cElementTree implementation, on the other hand, has a much nicer interface, is pretty fast, and can handle streaming through the iterparse() method.

ElementTree is superior, if you can find a way to manage the errors.

salezica
  • 74,081
  • 25
  • 105
  • 166