Extract text from multiline content using sed

Question

How to extract the content between CDATA in the example below using sed (or another easy method) ?

The tricky thing is that the pattern must be evaluated on multiple lines, and also one part of the line must be kept in extracted result... so I expected some powerful tools like sed or awk to be able extract content of a file using a capturing regular expression .. without success !

Example of content:

<XmlBox className="com.example.ConfigData">
<xmlString><![CDATA[<ConfigData>
<myField>Here we go:

Yup.
</myField>
</ConfigData>]]></xmlString>
</XmlBox>

<XmlBox className="com.example.ServiceDefinition">
<xmlString><![CDATA[<ServiceDefinition>
<name>Tricky?</name>
</ServiceDefinition>]]></xmlString>
</XmlBox>

Expected result:

<ConfigData>
<myField>Here we go:

Yup.
</myField>
</ConfigData>

<ServiceDefinition>
<name>Tricky?</name>
</ServiceDefinition>

The related regular expression to capture it would be:

(?s)<XmlBox className=".+?">\s+<xmlString><!\[CDATA\[(.+?)\]\]></xmlString>\s+</XmlBox>

But HOW to automate this in a simple bash command ?? I thought it was so easy, isn't it ?

I would highly suggest using a program that is meant to deal with xml to parse xml. Like `xlmlint` or `xml_grep`. — JNevill, Mar 25 '20 at 19:16
yes, but I don't want to rely on a "valid" xml parser, here it is just a matter of extracting a captured group, or doing substring between markers... no big deal, right ? btw, this would be useful for other needs, but thanks for the hint. — Donatello, Mar 25 '20 at 19:21
No big deal. Like [parsing html with regex](https://stackoverflow.com/a/1732454/2221001) it's a fine idea. — JNevill, Mar 25 '20 at 19:26
99.999% of my use cases should work here... so I don't care :) — Donatello, Mar 25 '20 at 19:35

score 2 · Answer 1 · answered Mar 25 '20 at 19:30

2

As mentioned in the comments, this is a terrible idea. But, if you want to shoot yourself in the foot:

perl -000 -pe 's/<XmlBox className=".*">\s+<xmlString><\!\[CDATA\[([^]]*)\]\]><\/xmlString>\s*<\/XmlBox>/$1/' input

answered Mar 25 '20 at 19:30

William Pursell

204,365
48
270
300

Note that we could also use the exact same regexp as provided, by using `perl -00 -pE 's|CAPTURING_REGEXP|$1|' input.xml`. `-E` for full featured regexp and `|` for sed-like separator to not conflict with regexp. Great. Thanks a lot ! – Donatello Mar 25 '20 at 19:56

score 1 · Answer 2 · answered Mar 25 '20 at 22:37

Sed is awkward with multiline data. As others have mentioned, it's not a great tool for this job, but if that's what you really want, use tr to remove the newlines and then add them back in, e.g.

cat myfile | tr '\n' '\007' |sed 's/fromwhatever/towhatever/'

Then use tr to put the newlines back in. In the example above, octal 7 is a bell (which presumably doesn't occur in your data -- you can use any character that's not already present.

score 0 · Answer 3 · answered Mar 25 '20 at 19:41

Another pretty simple solution:

grep -ozP '(?s)<XmlBox className=".+?">\s+<xmlString><!\[CDATA\[\K.+?(?=\]\]></xmlString>\s+</XmlBox>)' data.xml

\K discards the previously matched characters from printing at the final
the positive lookahead (?=matchAfter) asserts that the match must be followed by matchAfter expression.

Thanks to https://stackoverflow.com/a/28060342/1034782

score 0 · Answer 4 · answered Mar 25 '20 at 21:59

Best solution found is using Python.

Write a (very) few lines of code in replace.py:

#!/usr/bin/python
import sys, re

# config
file = sys.argv[1]
find = sys.argv[2]
repl = sys.argv[3]

# run
with open (file, "r") as myfile:
     s=myfile.read()
print re.sub(find, repl, s)

Call it as below:

./replace.py input.xml 'CAPTURING_REGEXP' '\1' > output.xml
./replace.py input.xml '(?s)<XmlBox className=".+?">\s+<xmlString><!\[CDATA\[(.+?)\]\]></xmlString>\s+</XmlBox>' '\1' > output.xml

It does exactly what it is supposed to do (no drawback) and is surprisingly fast (10s to process a 750MB file).

Thanks to @kpie answer for the hint.

Doesn't work if the pattern spans a line unless you use re.DOTALL — Jay, Jul 01 '23 at 13:36

Extract text from multiline content using sed

4 Answers4