0

I have an XML that is over 1.7 gigs in size, but most of the data in it is not relevant. I am looking for a tool or method I can use to delete entries based on the content of one of its descriptor nodes.

Example:

<?xml version="1.0" encoding="UTF-8"?>
<entries>
 <entry>
  <category>11 -1234</category>
  <value>Some String</value>
 </entry>
 <entry>
  <category>15 -5678</category>
  <value>Some String</value>
 </entry>
</entries>

I want to delete all entries that have a category starting with 11.

Kevin
  • 33
  • 1
  • 3
  • Which tools do you have available? – Tomalak Nov 23 '16 at 10:30
  • Thats the problem - I don't know which tool to start using. I am a novice for most of this - I mostly just do HTML/CSS/very basic VBA, so notepad++ has always been sufficient. In this case, the file size is just too big to even open to get started. – Kevin Nov 23 '16 at 15:23
  • I see. The most natural choice for modifying XML would be XSLT, which is the programming language that was specifically made for this purpose. The XSLT progam that deletes nodes on a certain condition would be extremely simple ([example](http://stackoverflow.com/q/321860/18771)), but the input file size could be a problem. In default XSLT the the entire file needs to be loaded before it can be processed, and 1.7 GB could be close to some limitation. In any case you should try that option first, at the very least to have a reference. Maybe it's even fast enough for your purposes. – Tomalak Nov 23 '16 at 17:08
  • More advanced XSLT processors (for example Saxon, which can be downloaded for Windows) allow "streaming" transformations, which only work for uniform XML files (your seems to be uniform enough) and which do not need to load the entire file before starting their work. This would lead to much shorter execution times and greatly reduced memory consumption. – Tomalak Nov 23 '16 at 17:12
  • The third option would be to create a [SAX](https://msdn.microsoft.com/en-us/library/ms754682(v=vs.85).aspx)-based program (the Windows built-in XML library, MSXML, comes with a SAX parser). Writing a program that deletes nodes when a certain condition is met would not be too hard. You could even do it from VBA ([MSDN article](https://msdn.microsoft.com/en-us/library/ms994312.aspx)), but it's a bit more involved than using XSLT. However, the advantages of "streaming" would apply: low memory footprint+fast. – Tomalak Nov 23 '16 at 17:21
  • In order to test an XSLT program quickly and easily on Windows, you can download the [msxsl.exe](https://www.microsoft.com/en-us/download/details.aspx?id=21714) command line tool, but there are plenty of examples of how to invoke XSLT from VBA as well, if you prefer that. – Tomalak Nov 23 '16 at 17:25
  • I would start like this: Modify the XSLT from the answer linked above so that it deletes the right nodes for you. It should not be too hard, but you will need to learn some XPath basics to do it. When it works with a tiny test file, use msxsl.exe to run it against the original file. If it works and is fast enough for you, you're done. If it doesn't work, look into either switching to a SAX-based approach or to the Saxon XSLT processor with streaming. – Tomalak Nov 23 '16 at 17:28

0 Answers0