-2

How to extract the content between CDATA in the example below using sed (or another easy method) ?

The tricky thing is that the pattern must be evaluated on multiple lines, and also one part of the line must be kept in extracted result... so I expected some powerful tools like sed or awk to be able extract content of a file using a capturing regular expression .. without success !

Example of content:

<XmlBox className="com.example.ConfigData">
<xmlString><![CDATA[<ConfigData>
<myField>Here we go:

Yup.
</myField>
</ConfigData>]]></xmlString>
</XmlBox>

<XmlBox className="com.example.ServiceDefinition">
<xmlString><![CDATA[<ServiceDefinition>
<name>Tricky?</name>
</ServiceDefinition>]]></xmlString>
</XmlBox>

Expected result:

<ConfigData>
<myField>Here we go:

Yup.
</myField>
</ConfigData>

<ServiceDefinition>
<name>Tricky?</name>
</ServiceDefinition>

The related regular expression to capture it would be:

(?s)<XmlBox className=".+?">\s+<xmlString><!\[CDATA\[(.+?)\]\]></xmlString>\s+</XmlBox>

But HOW to automate this in a simple bash command ?? I thought it was so easy, isn't it ?

Donatello
  • 3,486
  • 3
  • 32
  • 38
  • 4
    I would highly suggest using a program that is meant to deal with xml to parse xml. Like `xlmlint` or `xml_grep`. – JNevill Mar 25 '20 at 19:16
  • yes, but I don't want to rely on a "valid" xml parser, here it is just a matter of extracting a captured group, or doing substring between markers... no big deal, right ? btw, this would be useful for other needs, but thanks for the hint. – Donatello Mar 25 '20 at 19:21
  • No big deal. Like [parsing html with regex](https://stackoverflow.com/a/1732454/2221001) it's a fine idea. – JNevill Mar 25 '20 at 19:26
  • 1
    99.999% of my use cases should work here... so I don't care :) – Donatello Mar 25 '20 at 19:35

4 Answers4

2

As mentioned in the comments, this is a terrible idea. But, if you want to shoot yourself in the foot:

perl -000 -pe 's/<XmlBox className=".*">\s+<xmlString><\!\[CDATA\[([^]]*)\]\]><\/xmlString>\s*<\/XmlBox>/$1/' input
William Pursell
  • 204,365
  • 48
  • 270
  • 300
  • Note that we could also use the exact same regexp as provided, by using `perl -00 -pE 's|CAPTURING_REGEXP|$1|' input.xml`. `-E` for full featured regexp and `|` for sed-like separator to not conflict with regexp. Great. Thanks a lot ! – Donatello Mar 25 '20 at 19:56
1

Sed is awkward with multiline data. As others have mentioned, it's not a great tool for this job, but if that's what you really want, use tr to remove the newlines and then add them back in, e.g.

cat myfile | tr '\n' '\007' |sed 's/fromwhatever/towhatever/'

Then use tr to put the newlines back in. In the example above, octal 7 is a bell (which presumably doesn't occur in your data -- you can use any character that's not already present.

Kyle Banerjee
  • 2,554
  • 4
  • 22
  • 30
0

Another pretty simple solution:

grep -ozP '(?s)<XmlBox className=".+?">\s+<xmlString><!\[CDATA\[\K.+?(?=\]\]></xmlString>\s+</XmlBox>)' data.xml
  • \K discards the previously matched characters from printing at the final
  • the positive lookahead (?=matchAfter) asserts that the match must be followed by matchAfter expression.

Thanks to https://stackoverflow.com/a/28060342/1034782

Donatello
  • 3,486
  • 3
  • 32
  • 38
0

Best solution found is using Python.

Write a (very) few lines of code in replace.py:

#!/usr/bin/python
import sys, re

# config
file = sys.argv[1]
find = sys.argv[2]
repl = sys.argv[3]

# run
with open (file, "r") as myfile:
     s=myfile.read()
print re.sub(find, repl, s)

Call it as below:

./replace.py input.xml 'CAPTURING_REGEXP' '\1' > output.xml
./replace.py input.xml '(?s)<XmlBox className=".+?">\s+<xmlString><!\[CDATA\[(.+?)\]\]></xmlString>\s+</XmlBox>' '\1' > output.xml

It does exactly what it is supposed to do (no drawback) and is surprisingly fast (10s to process a 750MB file).

Thanks to @kpie answer for the hint.

Donatello
  • 3,486
  • 3
  • 32
  • 38