find and trim string endings in XML file

Question

Kinda new at scripting. I'm mostly a C# coder but...

I have a an XML file that contains a lot of nodes with repeated names but they all have ".txt" in the value

Scan.xml

<Parent Tags>
      ...
     <FileNameWithPath> Some/Path/That/has/file.extension.txt</FileNameWithPath>
      ...
 </Parent Tags>
      ...     
<Parent Tags>
      ...
     <FileNameWithPath> Some/NewPath/That/has/Newfile.DifferentExtension.txt</FileNameWithPath>
      ...
 </Parent Tags>

I'm trying to write a (bash) script in Linux to remove all the ".txt" substrings within the file.

testing things out, I have

cat IpScan.xml | sed -ne '/<FileNameWithPath>/s#\s*<[^>]*>\s*##gp'

but this only displays the value of the tag in the terminal.

I've also tried something like this

grep -oP "<FileNameWithPath>(.*)</FileNameWithPath>" IpScan.xml | cut -d ">" -f 2 | cut -d "<" -f 1

My thinking is to loop through each result of sed or grep and process the end of the string but then I don't know how to write the value back to the file. Also, I'm not sure grep or sed allows you to iterate (??)

My Question is this: How can I open the file, change the value of the element to remove the ".txt" string and save the file with the updated values?

I would prefer not to have to install another package as the Linux box I'm working on does not have network connectivity.

How can I

You can use inline-editing of the sed command: sed -i 's#.txt##g' Scan.xml — EnlightMe, Dec 20 '19 at 21:04
[Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). — Cyrus, Dec 20 '19 at 21:37

score 3 · Answer 1 · answered Dec 21 '19 at 01:32

As already mentioned in the comments, it's generally a bad idea to use RegEx'es to manipulate XML files. But you can easily use XSLT for transforming parts of your XML. In the case of changing a single value, xmlstarlet provides a one line approach:

xmlstarlet ed -u "//Parent_Tags/FileNameWithPath" -x "normalize-space(concat(substring-before(.,'.txt'),substring-after(.,'.txt')))" input.xml

Here

The ed option means, that the value is edited/changed
The -u option specifies the XPath of the elements to be updated, like a for-each loop
The -x option specifies the new value relative to the context node specified by the -u option. Here the string before .txt is concatenated to the string after .txt. The normalize-space() function removes leading and trailing space.

The updated XML is output to STDOUT and can, of course, be redirected to the new XML file.

score 0 · Accepted Answer · answered Dec 21 '19 at 23:56

0

Try this simple sed command:

cat IpScan.xml | sed "s/\.txt</</"

explanation:

s/\.txt</</ substitute ".txt<" with "<" once per line

answered Dec 21 '19 at 23:56

Dudi Boy

4,551
1
15
30

1

Thank you! This was the only answer that didn't require me to use any other packages. Simple and very clean. I ended up using `sed` with the `-i` option to update the file; but other than that, it's just what I needed. – fifamaniac04 Dec 23 '19 at 14:32

find and trim string endings in XML file

2 Answers2

explanation: