3

My task is to load new set of data (which is written in XML file) and then compare it to the 'old' set (also in XML). All the changes are written to another file.

My program loads new and old file into two datasets, then row after row I compare primary key from the new set with the old one. When I find corresponding row, I check all fields and if there are differences with the old one, I write it to third set and then this set to a file.

Right now I use:

    newDS.ReadXml("data.xml");
    oldDS.ReadXml("old.xml");

and then I just find rows with corresponding primary key and compare other fields. It is working quite good for small files.

The problem is that my files may have up to about 4GB. If my new and old data are that big it is quite problematic to load 8GB of data to memory.

I would like to load my data in parts, but to compare I need whole old data (or how to get specific row with corresponding primary key from XML file?).

Another problem is that I don't know the structure of a XML file. It is defined by user.

What is the best way to work with such a big files? I thought about using LINQ to XML, but I don't know if it has options that can help with my problem. Maybe it would be better to leave XML and use something different?

soshman
  • 145
  • 1
  • 9
  • 1
    _"that I don't know the structure of a XML file. It is defined by user."_ You will at least have to know what a 'row' is. – H H Aug 26 '13 at 08:47
  • 1
    Are the elements in the files sorted? If yes, you can use a forward-moving sweep through the files reading just one element at a time. Even if no, you can read one element at a time from one of the files and hold the other in memory. Look up the SAX API for examples of how you might do this. – Brian O''Byrne Aug 26 '13 at 08:47
  • Is the data sorted in any way? Mkaes a big difference. – H H Aug 26 '13 at 08:48
  • @BrianO''Byrne - no need for the SAX API, XmlReader will do. – H H Aug 26 '13 at 08:48
  • 3
    Possible duplicate with http://stackoverflow.com/questions/5838657/how-can-i-use-linq-to-xml-to-query-huge-xml-files-with-reasonable-memory-consump – Alex Siepman Aug 26 '13 at 08:55
  • @AlexSiepman - closely related, yes. But this question has enough specific aspects to stand on its own. – H H Aug 26 '13 at 09:00
  • @Henk Holterman I know only what is written in XML file. Maybe I wasn't precise. I know structure because it's in the XML, but users can define variety of different datasets. My bad. – soshman Aug 26 '13 at 09:12
  • @Brian O''Byrne The data is not sorted. – soshman Aug 26 '13 at 09:13
  • @Alex Siepman It's similar, but not identical. I asked a few more things connected to my problem. – soshman Aug 26 '13 at 09:15
  • @HenkHolterman (and OP), Then I am looking forward to different answers ;-) – Alex Siepman Aug 26 '13 at 09:26
  • 1
    I think the link from Alex Siepman ins your best answer. Read the new file forward-only with an XmlReader and for each element use the technique described in that answer to look for corresponding elements in the old file. – Brian O''Byrne Aug 26 '13 at 09:51
  • @BrianO''Byrne, isn't this going to be O(N^2) in the size of the (huge) files? – jwg Aug 26 '13 at 12:29
  • Yes, it is. As is the OP's original algorithm, though this approach will be much, much slower. Such is the nature of a tradeoff that uses less RAM. Another option that could be suggested is to give the work to a database server. Serialize the XML into database tables and run a query to get a diff. The database server should find a query plan with better than O(N^2) and a RAM requirement better than 4GB. – Brian O''Byrne Aug 26 '13 at 12:53

1 Answers1

-2

You are absolutely right that you should leave XML. It is not a good tool for datasets this size, especially if the dataset consists of many 'records' all with the same structure. Not only are 4GB files unwieldy, but almost anything you use to load and parse them is going to use even more memory overhead than the size of the file.

I would recommend that you look at solutions involving an SQL database, but I have no idea how it can make sense to be analysing a 4GB file where you "don't know the structure [of the file]" because "it is defined by the user". What meaning do you ascribe to 'rows' and 'primary keys' if you don't understand the structure of the file? What do you know about the XML?

It might make sense eg. to read one file, store all the records with primary keys in a certain range, do the same for the other file, do the comparison of that data, then carry on. By segmenting the key space you make sure that you always find matches if they exist. It could also make sense to break your files into smaller chunks in the same way (although I still think XML storage this large is usually inappropriate). Can you say a little more about the problem?

jwg
  • 5,547
  • 3
  • 43
  • 57
  • XML is a fine format, especially when dealing with different data structures and technologies. – H H Aug 26 '13 at 08:55
  • What does that comment even mean? – jwg Aug 26 '13 at 09:09
  • -1 for "XML is a tool". It's not, it's a file format, and it's both interoperable and easy to serialise into another format which may be faster to search through to answer the OP's original question. – Jay Aug 26 '13 at 09:58
  • Whether it is easy to serialize depends on whether the data was a good fit for XML in the first place. – jwg Aug 26 '13 at 10:15