I have a huge XML document (more than 12 GB) that I need to parse in the following way...
Given a structure like this:
<person name=Alice>
<colour>blue</colour>
</person>
<person name=Bob>
<colour>green</colour>
</person>
<person name=Charles>
<colour>blue</colour>
</person>
I would like to extract in a separate file only those person
elements that contains the subfield <colour> blue </colour>
.
For example, given the previous XML code, the output of the program should be a separate file with the following content:
<person name=Alice>
<colour>blue</colour>
</person>
<person name=Charles>
<colour>blue</colour>
</person>
I have tried to use grep
and sed
, as they are very useful tools for this purpose and also can manage huge files like mine, but I'm not quite sure about the regex I should use.
Thanks in advance!
EDIT: as I have noted, I need a stream-based tool, as otherwise the program simply crashes! I've tried xmlstarlet
but the program is auto-killed (I suppose due to memory use).
EDIT2: I also tried to split the file using xml_split
, but the amount of subfiles generated is simply untreatable. So, any suggestion?