how to replace character in html attribute value (shell / bash)?

Question

Sorry for the stupid question, but I have been stuck all afternoon with this simple problem. So I have a sample text file containing:

<product productId="123456" description="good apple, very green" publicPriceTTC="5,07" brand-id="152" />
<product productId="123457" description="fresh orange, very juicy" publicPriceTTC="12,47" brand-id="153" />
<product productId="123458" description="big banana, very yellow" publicPriceTTC="5,07" brand-id="154" />

And I'd like to modify this file into:

<product productId="123456" description="good apple, very green" publicPriceTTC="5.07" brand-id="152" />
<product productId="123457" description="fresh orange, very juicy" publicPriceTTC="12.47" brand-id="153" />
<product productId="123458" description="big banana, very yellow" publicPriceTTC="5.07" brand-id="154" />

Basically, I need to replace the "," (comma) by a "." (point) in all values of "publicPriceTTC". The trick here is that other attributes might have commas in their values ("description" in this example). I guess sed or awk can do that but I was unable to achieve it.

Can someone help me? Thank you very much for any help.

score 4 · Accepted Answer · answered Sep 30 '17 at 22:32

If you search for a comma to replace with a point, you will be doing a very coarse search/replace. Try something more especific. With sed, assume your input file is called xml:

sed -E 's/(publicPriceTTC="[0-9]+),([0-9]+")/\1.\2/' xml

You probably know that sed has the command s/<what you search>/<replacement>. We use that.

The -E option triggers the use of extended regular expressions. With that the s expression matches the whole tag + "=" + number within quotes, and uses the parenthesis to use the bit within them to be part of the substitution. \1 stands for the first bit between parenthesis block; \2 for the second.

You could of course make the search more robust to cope with whitespace between the tag and the equal sign and so on.

No problem max. Note that it is more robust to use an XML parser, but if you know the XML file has a certain format, it will be faster to use a search/replace approach. — Javier Elices, Sep 30 '17 at 23:48

ghoti · Answer 2 · 2017-10-01T13:01:12.043

An awk solution to this might be:

awk '/<product/{for(i=1;i<=NF;i++){if($i~/^publicPriceTTC="/)sub(/,/,".",$i)}}1' file.xml

This steps through every whitespace-separated "field" on every <product>, looking for "words" that begin with the attribute you're trying to modify. If found, the entire attribute has its commas replaced with periods.

A simpler awk solution to emulate what others are doing with sed would be nice, except that awk does not support parenthesized subexpressions (i.e. \1 in your replacement string). Gawk supports them in the gensub() function, so the following might suffice:

gawk '{print gensub(/(publicPriceTTC="[0-9]+),/,"\\1.","g")}' file.xml

But ... you are solving the wrong problem here. Tools like sed and awk, which process files based on regular expressions, are not XML parsers. Either Javier's sed solution or my awk solutions could garble things accidentally, or miss certain things that are in perfectly valid XML files. Regex cannot be used to parse XML safely.

I recommend that you look into using python or perl or ruby or php or some other language with native XML support.

For example, turning your input into actual XML like this:

<p>
<product productId="123456" description="good apple, very green" publicPriceTTC="5,07" brand-id="152" />
<product productId="123457" description="fresh orange, very juicy" publicPriceTTC="12,47" brand-id="153" />
<product productId="123458" description="big banana, very yellow" publicPriceTTC="5,07" brand-id="154" />
</p>

We could run a PHP one-liner:

php -r '$x=new SimpleXMLElement(file_get_contents("file.xml")); foreach($x->product as $p) { $p["publicPriceTTC"]=str_replace(",",".",$p["publicPriceTTC"]); } print $x->asXML();'

Or split out for easier reading (and commenting):

<?php

// Read an XML file into an object
$x=new SimpleXMLElement(file_get_contents("file.xml"));

// Step through the object, fixing attributes as we find them
foreach($x->product as $p) {
  $p["publicPriceTTC"] = str_replace(",",".",$p["publicPriceTTC"]);
}

// Print the result
print $x->asXML();

Thank you very much. I see your solution works also. I know parsing an XML file with sed/awk is a bad idea but I was not even able to edit the document with xmlstarlet (the command only returns "killed"). — max, Sep 30 '17 at 23:37
I've never used xmlstarlet, but if the binary for that program is dumping core or getting killed with some kind of signal, then you may have a bad binary, or a broken installation. There are SO MANY tools that know how to speak xml though... No need to limit yourself to just one. :) — ghoti, Oct 01 '17 at 00:20

score 0 · Answer 3 · answered Oct 01 '17 at 08:27

0

This will work on GNU

sed  's/\(publicPriceTTC="[0-9]*\),/\1./' fileName

answered Oct 01 '17 at 08:27

R. Kumar

103
5

score 0 · Answer 4 · answered Oct 01 '17 at 18:58

0

Here using sub in awk is enough.

awk '{sub(/,/,".",$7)}1' file

answered Oct 01 '17 at 18:58

Claes Wikner

1,457
1
9
8

how to replace character in html attribute value (shell / bash)?

4 Answers4