0

Say I have the below simplified type of XML file and need to extract all of the string data that is within the <innerElement> and </innerElement> tags only for the Id 1234.

<outerTag>
  <innerElement>
    <Id>1234</Id>
    <fName>Kim</fName>
    <lName>Scott</lName>
<customData1>Value1</customData1>
<customData2>Value2</customData2>
    <position>North</position>
    <title/>
  </innerElement>
  <innerElement>
    <Id>5678</Id>
    <fName>Brian</fName>
    <lName>Davis</lName>
<customData3>value3</customData3>
<customData4>value4</customData4>
<customData5>value5</customData5>
    <position>South</position>
    <title/>
  </innerElement>
</outerTag>

My expected output is:

<innerElement>
    <Id>1234</Id>
    <fName>Kim</fName>
    <lName>Scott</lName>
<customData1>Value1</customData1>
<customData2>Value2</customData2>
    <position>North</position>
    <title/>
</innerElement>

Using what I have read on other posts I have tried using grep -z to match multiline strings (treating the file contents as a single line) and -o to print only exact matches, but when I use the .* wildcard after the Id element it ends matching everything up to the end of the file instead of stopping on the fist occurrence.

grep -zo '<innerElement>.*<Id>1234</Id>.*</innerElement>' myfile.xml

How can I make the pattern match up to only the fist occurrence or the tag after Id 1234?

Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
aldehc99
  • 23
  • 5
  • 6
    Please [Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). – Cyrus Apr 23 '23 at 16:02
  • @Cyrus thanks, read the post, didn't know about the craziness of using regex for XML/html parsing. Will get through those parsers instead. – aldehc99 Apr 23 '23 at 16:27
  • 1
    Avoid modifying the input file without any communication when you get answers, thanks – Gilles Quénot Apr 23 '23 at 19:54

1 Answers1

4

Don't use sed nor regex to parse XML you cannot, must not parse any structured text like XML/HTML with tools designed to process raw text lines. If you need to process XML/HTML, use an XML/HTML parser. A great majority of languages have built-in support for parsing XML and there are dedicated tools like xidel, xmlstarlet or xmllint if you need a quick shot from a command line shell.

With a proper XML parser

Using xidel

xidel --xml -e '//innerElement[Id="1234"]' file.xml

Using xmlstarlet

xmlstarlet sel -t -c '//innerElement[Id="1234"]' file.xml

Using xmllint

xmllint --xpath '//innerElement[Id="1234"]' file.xml

Output

<innerElement>
    <Id>1234</Id>
    <fName>Kim</fName>
    <lName>Scott</lName>
    <customData1>Value1</customData1>
    <customData2>Value2</customData2>
    <position>North<position><title/></position>
  </position>
  </innerElement>
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
  • 1
    Indeed `xmlstarlet` is the right rool – anubhava Apr 23 '23 at 16:47
  • 1
    I'm not sure I understand your approach to this issue. Your query (or rather queries) do not produce that particular output. And why would you edit XML-file instead of just extracting what's needed? `xidel -s file.xml -e '//innerElement[Id="1234"]' --output-node-format=xml --output-node-indent` – Reino Apr 23 '23 at 19:39
  • Because I had the nested node with my attempts for whatever reason. – Gilles Quénot Apr 23 '23 at 19:42
  • **xmlstarlet** worked like a charm for me. I needed to extract only certain nodes based on their IDs from an xml file and place them on another. I used this line ---> `xmlstarlet sel -t -c '//innerElement[./Id[text()='1234']]' inputFile.xml >> outputFile.xml` – aldehc99 Apr 28 '23 at 16:30
  • You can use `xmlstarlet edit --help` to move nodes – Gilles Quénot Apr 28 '23 at 16:39