Extract XML elements based on inner content

Question

I have a huge XML document (more than 12 GB) that I need to parse in the following way...

Given a structure like this:

<person name=Alice>
   <colour>blue</colour>
</person>

<person name=Bob>
   <colour>green</colour>
</person>

<person name=Charles>
   <colour>blue</colour>
</person>

I would like to extract in a separate file only those person elements that contains the subfield <colour> blue </colour>.

For example, given the previous XML code, the output of the program should be a separate file with the following content:

<person name=Alice>
   <colour>blue</colour>
</person>

<person name=Charles>
   <colour>blue</colour>
</person>

I have tried to use grep and sed, as they are very useful tools for this purpose and also can manage huge files like mine, but I'm not quite sure about the regex I should use.

Thanks in advance!

EDIT: as I have noted, I need a stream-based tool, as otherwise the program simply crashes! I've tried xmlstarlet but the program is auto-killed (I suppose due to memory use).

EDIT2: I also tried to split the file using xml_split, but the amount of subfiles generated is simply untreatable. So, any suggestion?

[Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). — Cyrus, Mar 18 '20 at 18:59
@JackFleeting is stream-based? The problem is that, if not, the program is auto killed. — Antonio, Mar 19 '20 at 16:53

score 0 · Answer 1 · answered Mar 19 '20 at 22:42

Since none of the XML aware tools you've tried so far work for you and if your input is as simple and regular as you posted then:

$ awk -v RS= -v ORS='\n\n' '/<colour>blue</' file
<person name=Alice>
   <colour>blue</colour>
</person>

<person name=Charles>
   <colour>blue</colour>
</person>

If that's NOT all you need then edit your question to provide more truly representative sample input/output.

Extract XML elements based on inner content

1 Answers1