Match first hit only in a XML with grep

Question

Say I have the below simplified type of XML file and need to extract all of the string data that is within the <innerElement> and </innerElement> tags only for the Id 1234.

<outerTag>
  <innerElement>
    <Id>1234</Id>
    <fName>Kim</fName>
    <lName>Scott</lName>
<customData1>Value1</customData1>
<customData2>Value2</customData2>
    <position>North</position>
    <title/>
  </innerElement>
  <innerElement>
    <Id>5678</Id>
    <fName>Brian</fName>
    <lName>Davis</lName>
<customData3>value3</customData3>
<customData4>value4</customData4>
<customData5>value5</customData5>
    <position>South</position>
    <title/>
  </innerElement>
</outerTag>

My expected output is:

<innerElement>
    <Id>1234</Id>
    <fName>Kim</fName>
    <lName>Scott</lName>
<customData1>Value1</customData1>
<customData2>Value2</customData2>
    <position>North</position>
    <title/>
</innerElement>

Using what I have read on other posts I have tried using grep -z to match multiline strings (treating the file contents as a single line) and -o to print only exact matches, but when I use the .* wildcard after the Id element it ends matching everything up to the end of the file instead of stopping on the fist occurrence.

grep -zo '<innerElement>.*<Id>1234</Id>.*</innerElement>' myfile.xml

How can I make the pattern match up to only the fist occurrence or the tag after Id 1234?

Please [Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). — Cyrus, Apr 23 '23 at 16:02
@Cyrus thanks, read the post, didn't know about the craziness of using regex for XML/html parsing. Will get through those parsers instead. — aldehc99, Apr 23 '23 at 16:27
Avoid modifying the input file without any communication when you get answers, thanks — Gilles Quénot, Apr 23 '23 at 19:54

Gilles Quénot · Accepted Answer · 2023-04-24T00:37:52.480

4

Don't use sed nor regex to parse XML you cannot, must not parse any structured text like XML/HTML with tools designed to process raw text lines. If you need to process XML/HTML, use an XML/HTML parser. A great majority of languages have built-in support for parsing XML and there are dedicated tools like xidel, xmlstarlet or xmllint if you need a quick shot from a command line shell.

With a proper `XML` parser

Using `xidel`

xidel --xml -e '//innerElement[Id="1234"]' file.xml

Using `xmlstarlet`

xmlstarlet sel -t -c '//innerElement[Id="1234"]' file.xml

Using `xmllint`

xmllint --xpath '//innerElement[Id="1234"]' file.xml

Output

<innerElement>
    <Id>1234</Id>
    <fName>Kim</fName>
    <lName>Scott</lName>
    <customData1>Value1</customData1>
    <customData2>Value2</customData2>
    <position>North<position><title/></position>
  </position>
  </innerElement>

edited Apr 24 '23 at 00:37

answered Apr 23 '23 at 15:58

Gilles Quénot

173,512
41
224
223

1

Indeed `xmlstarlet` is the right rool – anubhava Apr 23 '23 at 16:47
1

I'm not sure I understand your approach to this issue. Your query (or rather queries) do not produce that particular output. And why would you edit XML-file instead of just extracting what's needed? `xidel -s file.xml -e '//innerElement[Id="1234"]' --output-node-format=xml --output-node-indent` – Reino Apr 23 '23 at 19:39
Because I had the nested node with my attempts for whatever reason. – Gilles Quénot Apr 23 '23 at 19:42
**xmlstarlet** worked like a charm for me. I needed to extract only certain nodes based on their IDs from an xml file and place them on another. I used this line ---> `xmlstarlet sel -t -c '//innerElement[./Id[text()='1234']]' inputFile.xml >> outputFile.xml` – aldehc99 Apr 28 '23 at 16:30
You can use `xmlstarlet edit --help` to move nodes – Gilles Quénot Apr 28 '23 at 16:39

Match first hit only in a XML with grep

1 Answers1

With a proper XML parser

Using xidel

Using xmlstarlet

Using xmllint

Output

With a proper `XML` parser

Using `xidel`

Using `xmlstarlet`

Using `xmllint`