0

I would like to replace the space characters inside XML file as for example:

from:

<UserDescription>
 <userName>Test User 1</userName>
</UserDescription>

to:

<UserDescription>
    <userName>Test_User_1</userName>
</UserDescription>

Prefer using sed as this is the most feasible option at this moment.

Appreciate for any suggestion or feedback. Thanks

jmachdy
  • 11
  • 1
  • Hello and welcome to StackOverflow. Please take some time to read the help page, especially the sections named ["What topics can I ask about here?"](http://stackoverflow.com/help/on-topic) and ["What types of questions should I avoid asking?"](http://stackoverflow.com/help/dont-ask). And more importantly, please read [the Stack Overflow question checklist](http://meta.stackexchange.com/q/156810/204922). You might also want to learn about [Minimal, Complete, and Verifiable Examples](http://stackoverflow.com/help/mcve). – Clijsters Mar 15 '18 at 15:08
  • If you don't have XMLStarlet, I'd suggest using one of the excellent XML modules in the Python standard library instead -- Python being very widely deployed, and thus available just about everywhere. – Charles Duffy Mar 15 '18 at 15:21
  • To go into just a little detail about why `sed` is the wrong thing -- XML syntax is complicated. `sed` has no feasible way to ignore things that look like tags but are in comments; things that look like tags but are in CDATA sections; macros added through DTD inclusion, stray newlines that don't change the parse; namespace remappings... etc, etc, etc. – Charles Duffy Mar 15 '18 at 15:23

3 Answers3

3

Don't parse HTML with regex, use a proper XML/HTML parser.

theory :

According to the compiling theory, HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.

realLife©®™ everyday tool in a :

You can use one of the following :

xmllint

xmlstarlet

saxon-lint (my own project)


Check: Using regular expressions with HTML tags


Example using :

xmlstarlet edit -L -u '//userName' \
  -x 'translate(//userName/text(), " ", "_")' file.xml

Output :

$ cat file.xml
<?xml version="1.0"?>
<UserDescription>
  <userName>Test_User_1</userName>
</UserDescription>
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
0

Using and (for fun):

from lxml import etree

myXML = 'file.xml'
tree = etree.parse(myXML)
root = tree.getroot()
code = root.xpath("//userName")
code[0].text = code[0].text.replace(' ', '_')
print(code[0].text)

etree.ElementTree(root).write(myXML, pretty_print=True)

Output :

$ cat file.xml
<?xml version="1.0"?>
<UserDescription>
  <userName>Test_User_1</userName>
</UserDescription>
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
0

Using also for fun :

#!/usr/bin/env perl
# edit file.xml file and save new one in new.xml
use strict; use warnings;

use XML::LibXML;

my $xl = XML::LibXML->new();
my $xml = $xl->load_xml(location => 'file.xml') ;

for my $node ($xml->findnodes('//userName/text()')) {
    my $value = $node->getValue;
    print $value;
    $value =~ s/\s+/_/g;
    $node->setData($value);
}

$xml->toFile('new.xml');
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223