2

I have very large XML files to process. I want to convert them to readable PDFs with colors, borders, images, tables and fonts. I don't have a lot of resources in my machine, thus, I need my application to be very optimal addressing memory and processor.

I did a humble research to make my mind about the technology to use but I could not decide what is the best programming language and API for my requirements. I believe DOM is not an option because it consumes a lot of memory, but, would Java with SAX parser fulfill my requirements?

Some people also recommended Python for XML parsing. Is it that good?

I would appreciate your kind advice.

mowienay
  • 1,264
  • 4
  • 19
  • 32
  • Python has a very simple and powerful library called BeautifulSoup which is great for XML parsing. – karthikr Jun 10 '13 at 06:32
  • Thank you karthikr very much. Is beautifulsoup gentle on memory and fast? – mowienay Jun 10 '13 at 06:33
  • Please quantify "very large". Would an engineer ask for help building a bridge over a "very wide" river? Would anyone dare to offer suggestions without knowing how wide the river actually is? I've heard people refer to 1Mb as very large. The solution for 1Mb is quite different from 1Gb. Generally I would be surprised if a document intended for human consumption is too big to fit in memory these days - unless there's a lot of image. – Michael Kay Jun 10 '13 at 09:23
  • Thank you Michael !! .. I want to handle around 200K XMLs each file is about 2 MBs. I will consider your advice later on. – mowienay Jun 10 '13 at 12:00
  • Have you looked at vtd-xml (http://vtd-xml.sf.net) – vtd-xml-author Jul 18 '13 at 19:31

4 Answers4

2

Yes I think Sax will work for you. Dom is not good for large XML files as It keeps the whole XML file in memory. You can see a Comparison I wrote in my blog here

Sanjaya Liyanage
  • 4,706
  • 9
  • 36
  • 50
2

SAX is very good parser but it is outdated.

Recently Oracle have launched new Parser to parse the xml files efficiently called Stax

*http://docs.oracle.com/cd/E17802_01/webservices/webservices/docs/1.6/tutorial/doc/SJSXP2.html*

Attached link will also shows comparisons of all parsers along with memory utilization and its features.

Thanks, Pavan

Pavan
  • 1,219
  • 13
  • 15
1

Not sure if you're interested in using Perl, but if you're open to it, the following are all good options: LibXML, LibXSLT and XML-Twig, which is good for files too large to fit in memory (so is LibXML::Reader). Of course as SAX is there, but it can be slow. Most people recommend the first two options. Finally, CPAN is an amazing source with a very active community.

Steve P.
  • 14,489
  • 8
  • 42
  • 72
1

If you want the best of DOM without its memory overhead, vtd-xml is the best bet, here is the proof...

http://recipp.ipp.pt/bitstream/10400.22/1847/1/ART_BrunoOliveira_2013.pdf

Community
  • 1
  • 1
vtd-xml-author
  • 3,319
  • 4
  • 22
  • 30