4

I have two large XML files (3GB, 80000 records). One is updated version of another. I want to identify which records changed (were added/updated/deleted). There are some timestamps in the files, but I am not sure they can be trusted. Same with order of records within the files.

The files are too large to load into memory as XML (even one, never mind both).

The way I was thinking about it is to do some sort of parsing/indexing of content offset within the first file on record-level with in-memory map of IDs, then stream the second file and use random-access to compare those records that exist in both. This would probably take 2 or 3 passes but that's fine. But I cannot find easy library/approach that would let me do it. vtd-xml with VTDNavHuge looks interesting, but I cannot understand (from documentation) whether it supports random-access revisiting and loading of records based on pre-saved locations.

Java library/solution is preferred, but C# is acceptable too.

Alexandre Rafalovitch
  • 9,709
  • 1
  • 24
  • 27
  • 1
    Extended Vtd- xml supports memory mapping, which means it is possible that the document is not entirely loaded in memory. It supports random access just like the standard vtd-xml. When you say you can't load the docs in memory, i think extended vtd might be the right choice, and should be easier to use and faster than sax. – vtd-xml-author Apr 11 '13 at 05:05
  • When you talk about _"The files are too large to load into memory as XML"_ what data structures did you use? – classicjonesynz Apr 11 '13 at 07:32
  • If you can't find anything to analyze the files with in code, There are some other solutions such as the [notepad++ compare](http://sourceforge.net/projects/npp-compare/) or the opensource project [winmerge](http://sourceforge.net/projects/winmerge/?source=dlp) – classicjonesynz Apr 11 '13 at 07:43
  • @vtd-xml-author I did look at Extended Vtd, but I can't figure out how to revisit the record. There seem to be methods to get the position of element as long[] or as index, but not methods to come back to that position. Is there an example for that? – Alexandre Rafalovitch Apr 11 '13 at 13:58
  • @KIllrawr - Any in-memory XML structure was just much bigger than the original file. The only way I was able to deal with it to date was to use streaming mode and throw away irrelevant information. That worked for one file. But with two files, I need a different algorithm again. – Alexandre Rafalovitch Apr 11 '13 at 13:59
  • 1
    VTD records can be accessed from VTDNavHuge object, they are essensially a big array which u can address by specifying an index value. Every record has an offset, a length, a type and a depth. Let me know if you need more info due to limited space here... – vtd-xml-author Apr 11 '13 at 21:01
  • have a look at this http://andreas.haufler.info/2012/01/conveniently-processing-large-xml-files.html – constantlearner May 13 '13 at 20:15
  • @constantlearner - thanks, but that's for a single document. I already know how to deal with that. I use XOM for that, [code is available](https://github.com/arafalov/Lotus-Notes-Exporter/tree/master/src/alex). My problem was comparing files, the next level issue. – Alexandre Rafalovitch May 14 '13 at 15:23

1 Answers1

1

Just parse both documents simultaneously using SAX or StAX until you encounter a difference, then exit. It doesn't keep the document in memory. Any standard XML library will support S(t)AX. The only problem would be if you consider different order of elements to be insignificant...