Remove specific text from an XML like file using sed

Question

I have the following file (which is a JUnit report file) from which I need to remove the system-out and system-err nodes and their content, while preserving the other node structures (elements and values).

My file has the following type of structure and content (please note a system-* element can have multiline content and html like tags):

<testsuite name="someTest" tests="1" skipped="0" failures="0" errors="0">
  <properties/>
  <testcase name="someMethod" classname="classA" time="0.096">
    <system-out><![CDATA[foo <li></li> bar]]></system-out>
    <system-err><![CDATA[[one] INFO two
three four 
five]]></system-err>
  </testcase>
  <system-out><![CDATA[]]></system-out>
  <system-err><![CDATA[]]></system-err>
</testsuite>

The desired result is to have

<testsuite name="someTest" tests="1" skipped="0" failures="0" errors="0">
  <properties/>
  <testcase name="someMethod" classname="classA" time="0.096">
  </testcase>
</testsuite>

I have tried multiple variants of sed patterns and the following is not nice but partially works. The current approach is to use tr to replace new lines with some exotic character, then apply sed on the one line text, then reuse tr to include the previous new lines (I combined several SO suggestions to have it and I don't really know how to use the multiple sed -N flag):

tr "\n" "\f" < "$f" |
sed 's/\(<system-err>\)\(.*\)\(<\/system-err>\)/\1\3/' |
sed 's/\(<system-out>\)\(.*\)\(<\/system-out>\)/\1\3/' |
tr "\f" "\n" > $(basename "$f")-out.xml

The problem with this is that the sed is greedy and for instance will remove from first system-err to last one, leaving unclosed elements. I have tried multiple things, also to use a pattern as sed -E 's/<system-out><![(.*)]><\/system-out>//g', to match anything in between the system-* text but it does not really work.

I am not a sed or regexp expert, so please be merciful :). My constraint is the need to use sed (inside a bash script).

Could someone please advise how to achieve the removal of the .

Thank you in advance!

[Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). — Cyrus, Jul 08 '21 at 20:17

score 4 · Accepted Answer · answered Jul 08 '21 at 20:21

4

With xmlstarlet:

xmlstarlet edit --omit-decl --delete '//system-out' --delete '//system-err' file.xml

Output:

<testsuite name="someTest" tests="1" skipped="0" failures="0" errors="0">
  <properties/>
  <testcase name="someMethod" classname="classA" time="0.096"/>
</testsuite>

See: xmlstarlet edit --help

answered Jul 08 '21 at 20:21

Cyrus

84,225
14
89
153

this answer is brilliant, nice and short (did the trick). But, although I understand it's not the best idea (to say the least) to use sed for XML/XHTML like parsing, is there any way to use sed for this? (i understand it might be considered "overkill") – acostache Jul 08 '21 at 20:43
3

@avostache sed working with xml isn't overkill; it's more like using a screwdriver to pound a nail. – Shawn Jul 08 '21 at 21:36
1

although i would have wanted a sed related answer, this xmlstarlet one works perfect so i will mark it as accepted (in addition i understand it is the recommended practice for dealing with XML file processing). thank you for the help! – acostache Jul 09 '21 at 09:32
I was confused for a minute because this didn't work for me; the issue was that the document had a namespace. There is a way to specify the namespace that interested readers can look up, but the easiest fix is to use the default namespace: `//_:your-tag` instead of `//your-tag`. – Matthew Read Nov 19 '22 at 05:31

score 2 · Answer 2 · answered Jul 10 '21 at 05:47

With sed.

Warning: There is a high probability that it will not work if the file has a slightly different structure.

sed -e '\|<system-out>.*</system-out>|d' \
    -e '\|<system-err>.*</system-err>|d' \
    -e '\|<system-err>|,\|</system-err>|d' file.xml

I switched from // to \||.

Output:

<testsuite name="someTest" tests="1" skipped="0" failures="0" errors="0">
  <properties/>
  <testcase name="someMethod" classname="classA" time="0.096">
  </testcase>
</testsuite>

Reino · Answer 3 · 2022-02-26T02:09:24.807

0

With xidel:

$ xidel -s input.xml -e '
  x:replace-nodes(/,(//system-out,//system-err),())
' --output-node-format=xml --output-node-indent
<testsuite name="someTest" tests="1" skipped="0" failures="0" errors="0">
  <properties/>
  <testcase name="someMethod" classname="classA" time="0.096">
  </testcase>
</testsuite>

edited Feb 26 '22 at 02:09

answered Jul 09 '21 at 12:07

Reino

3,203
1
13
21

score 0 · Answer 4 · answered Jul 09 '21 at 13:12

Just for the record, xmlstarlet does not work well with large files (i.e. for 30+ MBs sized files it throws the "huge input lookup" error). But it's brilliant for the small usecase in my initial question, so Cyrus's answer did the trick.

If anybody needs something working for larger files, as mentioned (personally I needed something scalable as well), I found a Python related straight-forward solution (so no sed here either):

import xml.etree.ElementTree as ET

file = "myJunitReport.xml"    
tree = ET.parse(file)
root = tree.getroot()

# remove top level system-out/system-err
for elem in root.findall('system-out'):
    root.remove(elem)
for elem in root.findall('system-err'):
    root.remove(elem)

# remove testcase related system-out/system-err
for child in root.findall("testcase"):
    for profile in child.findall(".//system-out"):
        child.remove(profile)
    for profile in child.findall(".//system-err"):
        child.remove(profile)

tree.write(file)

An important part is that I am using Python's default XML ElementTree API. Other solutions, like lxml.etree complain about large files as well.

Truly hope this helps someone else struggling with such scenarios.

Remove specific text from an XML like file using sed

4 Answers4