8

I need to parse XML files of 40GB in size, and then normalize, and insert to a MySQL database. How much of the file I need to store in the database is not clear, neither do I know the XML structure.

Which parser should I use, and how would you go about doing this?

hakre
  • 193,403
  • 52
  • 435
  • 836
clean_coding
  • 1,156
  • 1
  • 9
  • 14
  • 1
    For large files, always use a pull parser like XMLReader; just try loading 40GB of file into memory for SimpleXML. – Mark Baker Mar 12 '13 at 14:25
  • 1
    from the related posts: http://stackoverflow.com/questions/911663/parsing-huge-xml-files-in-php?rq=1 – Nick Andriopoulos Mar 12 '13 at 14:26
  • 40 Gb? Wow, I had trouble with a 700 Mb xml file on a mac os x machine. – fizzy drink Mar 12 '13 at 14:26
  • XMLReader will read in chunks so that's the one you want to use. SimpleXML places the whole thing in memory which is not ideal for large files – zgr024 Mar 12 '13 at 14:27
  • 1
    I would suggest you to write your own XML parser with minimum features (only the features you want). This way you have the most light-wieght XML parser possible and of course keep optimizing regularly – Girish Mar 12 '13 at 14:27
  • check here to create your own... [http://stackoverflow.com/questions/7707745/reading-a-large-file-line-by-line-or-storing-its-lines-in-an-array](http://stackoverflow.com/questions/7707745/reading-a-large-file-line-by-line-or-storing-its-lines-in-an-array) – zgr024 Mar 12 '13 at 14:29
  • If the format is predicable, I would probably do a very lightweight custom reader. Cleaning up after data is inserted into mysql before moving on to the next set. – datasage Mar 12 '13 at 14:31
  • 5
    @Girish - that's a crazy suggestion. Some of the best programmers on the planet have written highly optimized XML parsers, the chance of the average SO poster being able to do better is very small. – Michael Kay Mar 12 '13 at 18:34
  • @MichaelKay - I agree but when size is as big as 40GB, its probably a very high priority+resource task, it might not be a bad suggestion. – Girish Mar 12 '13 at 18:37
  • "neither do I know the XML structure" — good luck with that! – salathe Mar 12 '13 at 18:45
  • @Girish, having a large data file and a high priority task does not automatically mean that you have access to the world's best programmers. – Michael Kay Mar 14 '13 at 15:56
  • http://amolnpujari.wordpress.com/2012/03/31/reading_huge_xml-rb/ its so simple to deal with large xml in ruby – Amol Pujari Feb 19 '14 at 21:07

2 Answers2

13

In PHP, you can read in extreme large XML files with the XMLReaderDocs:

$reader = new XMLReader();
$reader->open($xmlfile);

Extreme large XML files should be stored in a compressed format on disk. At least this makes sense as XML files have a high compression ratio. For example gzipped like large.xml.gz.

PHP supports that quite well with XMLReader via the compression wrappersDocs:

$xmlfile = 'compress.zlib://path/to/large.xml.gz';

$reader = new XMLReader();
$reader->open($xmlfile);

The XMLReader allows you to operate on the current element "only". That means it's forward-only. If you need to keep parser state, you need to build it your own.

I often find it helpful to wrap the basic movements into a set of iterators that know how to operate on XMLReader like iterating through elements or child-elements only. You find this outlined in Parse XML with PHP and XMLReader.

See as well:

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
2

It would be nice to know what you actually intend to do with the XML. The way you parse it depends very much on the processing you need to carry out, as well as the size.

If this is a one-off task, then I've started in the past by discovering the XML structure before doing anything else. My DTDGenerator (see saxon.sf.net) was written for this purpose a long time ago and still does the job, there are other tools available now but I don't know whether they do streamed processing which is a prerequisite here.

You can write an application that processes the data using either a pull or push streamed parser (SAX or StAX). How easy this is depends on how much processing you have to do and how much state you have to maintain, which you haven't told us. Alternatively you could try streamed XSLT processing, which is available in Saxon-EE.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164