1

Original question below, update regarding solution, if someone has a similar problem:

For a fast regex I found http://re2c.org/ ; for xml parsing http://expat.sourceforge.net/


Is there an xml library I can use to parse xml from memory (and not from file) in a streaming manner in c?

Currently I have:

  • libxml2 ; XMLReader seems to only be possible to use with a filehandle and not in-memory
  • rapidxml is c++ and does not seem to expose a c interface

Requirements:

  • I need to process the individual xml nodes without having the whole xml (400GB uncompressed, and "only" 29GB as original .bz2 file) in memory ( bzip'd file gets read in and decompressed piecewise, and I would pass those uncompressed pieces to be consumed by the xml parser )
  • It does not need to very fast, but I would prefer an efficient solution
  • I (most probably) don't need the path of an extracted node, so it would be fine to just discard them as soon as they have been processed by my callback (if I would need the path contrary to what I think right now, I could then still track it myself)

This is part of me trying to solve my own problem posted here (and no, it's not the same question): How to efficiently parse large bz2 xml file in C

Ideally I'd like to be able to feed the library a certain amount of bytes at a time and have a function called whenever a node is completed.

Thank you very much


Here's some pseudo c code (way shorter than actual c code) for a better understanding

// extracted data gets put here
strm.next_out = buffer_ptr;

while( bytes_processed_total < filesize ) {

  // extracts up to amount of data set in strm.avail_in
  BZ2_bzDecompress( strm );

  bytes_processed = strm.next_out - buffer_ptr;
  bytes_processed_total += bytes_processed;

  // here I would like to pass bytes_processed of buffer_ptr to xmlreader

}

About the data I want to parse: http://wiki.openstreetmap.org/wiki/OSM_XML

At the moment I only need certain <node ...> nodes from this, which have subnode <tag k="place" v="country|county|city|town|village"> (the '|' means at least one of those in this context, in the file it's of course only "country" etc without the '|')

Community
  • 1
  • 1
griffin
  • 1,261
  • 8
  • 24
  • I'm a little confused it one thing… first you say that you want to parse a file that is stored in memory, not in a file, but then: "I need to process the […] nodes without having the whole xml […] in memory". Is it in memory or not? – sidyll Aug 29 '13 at 13:32
  • 1
    You're right, it's confusing. I'll update the question right now. Thanks for pointing that out. – griffin Aug 29 '13 at 13:33
  • Updated, hope it's better now. – griffin Aug 29 '13 at 13:44

1 Answers1

2

xmlReaderForMemory from libxml2 seems a good one to me (but haven't used it so, I may be wrong)

the char * buffer needs to point to a valid XML document (that can be a part of your entire XML file). This can be extracted reading in chuncks your file but obtaining a valid XML fragment.

What's the structure of your XML file ? A root containing subsequent similar nodes or a fully fledged tree ?

If I had an XML like this:

<root>
<node>...</node>
<node>...</node>
<node>...</node>
</root>

I'd read starting from the opening <node> till the closing </node> and then parse it with the xmlReaderForMemory function, do what I need to do, then go on with the next <node> node.

Ofc if your <node> content is too complex/long, you may have to go deep some levels:

<node>
<subnode>....</subnode>
<subnode>....</subnode>
<subnode>....</subnode>
<subnode>....</subnode>
</node>

And read from the file until you have the entire <subnode> node (but keeping track that you're in a <node>.

I know it's ugly, but is a viable way. Or you can try to use a sax parser (dunno if some C implementation exists).

Sax parsing fires events on each node start and node end, so you can do nothing untill you find your nodes and process just them.

Another viable way can be using some external tools to filter the whole XML (XQuery or XPath processors) in order to extract just your interesting nodes from the whole file, obtain a smaller doc and then work on it.

EDIT: Zorba was a good XQuery framework, with command line preprocessor, may be a good place to look at

EDIT2: well since you have this dimensions, one alternative solution can be manage the file as a text file, so read and uncompress in chunks and then matching something like:

<yourNode>.*</yourNode>

with regexp.

If you're on a Linux/Unix you should have POSIX regexp library. Check
this question on S.O. for further insights.

Community
  • 1
  • 1
BigMike
  • 6,683
  • 1
  • 23
  • 24
  • I saw that function, but I didn't think of building documents from parts of my data. Altough very inefficient with 400gb of data, at least it sounds like a doable way, so thank you for now. Regarding structure: http://wiki.openstreetmap.org/wiki/OSM_XML except for maybe counting nodes I wouldn't really know how to split this in a good way, and even then a node with subnodes (e.g. relation with members) could be huge so I hope there is a better solution to this, but I'll keep yours in mind If there is no other option. – griffin Aug 29 '13 at 13:49
  • Also to add (don't really know if I should put this in my question): I only need certain `` nodes from the data right now, I don't really need to parse everything. To be exact, I only need those with a subnode of `` (and v=city|town|county|village) – griffin Aug 29 '13 at 13:52
  • if you need just some parts you can preprocess your file via XQuery or XPath and then work on just the selected nodes – BigMike Aug 29 '13 at 13:56
  • The file is a 29gb .bz2 file, and don't want to first uncompress it to 400gb every time, as data changes once a week, and I already have the code to decompress in a loop. If I understand correctly, preprocessing would mean that I would not only parse the file multiple times (preprocessing is parsing it once already), but also have to have it lying around uncompressed. Also, reading from a tag start to the end of it would mean that I have to parse that part myself, at which point I would be building a parser for passing it to a parser afterwards - which seems besides the point to me? – griffin Aug 29 '13 at 14:00
  • I really like to work directly on the .bz2 file, as uncompressing it beforehand makes the whole process a lot slower and more I/O intensive without any real advantage (I don't need random seeks, and the file is always only parsed once whenever there is an update, which is once a week currently) – griffin Aug 29 '13 at 14:02
  • reading and parsing are slightly different things :D. Since you're looking just for few nodes, why don't you manage it like a normal text file, read in chunks and strcmp till you identify your nodes? okay is XML formally, but in the end is just text – BigMike Aug 29 '13 at 14:04
  • I know that, but identifying what a tag is means I will have to match it, which is way more work than just strcmp in my case, as I'm not uncompressing linewise (you don't know where the line end is before decompressing it), and as there could theoretically be something like `attribute=""` in there (it's not against XML spec if I remember right?), I would need to parse the decompressed data to make sure I get the right part. That's why I said parsing instead of reading there. – griffin Aug 29 '13 at 14:08
  • ofc is a bit more than strcmp. However matching ".*" - regexp like, in a char * buff and then work just on the simple node can be a good solution – BigMike Aug 29 '13 at 14:12
  • 1
    Hm looking at the example file that actually sounds like a great idea - do you know of any good regex lib I could use for that? One where I can precompile the pattern would be good, as the pattern would not change over the whole 400gb of data, and regex is normally way slower than a real parser (if you have, update your answer with that, so I can accept it!) – griffin Aug 29 '13 at 14:15
  • if you're on a linux/unix machine you should have POSIX regexp library check here http://stackoverflow.com/questions/1085083/regular-expressions-in-c-examples – BigMike Aug 29 '13 at 14:18
  • Thx! Looks like a solution! Can you put that stuff into your answer so I can accept it? Thx! – griffin Aug 29 '13 at 14:23