1

I have hundreds of xml files containing below type of text

<Init dflt_value='1.00' max_value='1000000.00' diff_ele='1.0' new='Yes' />

where max_value element may have different values.

Issue: I need to replace value of max_value element to 100(for example) in all files. I tried doing something like below.

grep -rl 'max_value' | xargs sed -i "s/max_value='.*'/max_value='25'/g"

But nothing is working for me. What might be the solution for it?

Ambuj
  • 33
  • 1
  • 4

3 Answers3

0

Don't parse XML/HTML with regex, use a proper XML/HTML parser and a powerful query.

theory :

According to the compiling theory, XML/HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of XML/HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.

realLife©®™ everyday tool in a :

You can use one of the following :

xmllint often installed by default with libxml2, xpath1 (check my wrapper to have newlines delimited output

xmlstarlet can edit, select, transform... Not installed by default, xpath1

xpath installed via perl's module XML::XPath, xpath1

xidel xpath3

saxon-lint my own project, wrapper over @Michael Kay's Saxon-HE Java library, xpath3

or you can use high level languages and proper libs, I think of :

's lxml (from lxml import etree)

's XML::LibXML, XML::XPath, XML::Twig::XPath, HTML::TreeBuilder::XPath

, check this example

DOMXpath, check this example


Check: Using regular expressions with HTML tags


Example using with :

xmlstarlet ed -u '//Init/@max_value' -v '100' *.xml

If you want to edit in place, use -L switch :

xmlstarlet ed -L -u '//Init/@max_value' -v '100' *.xml

Example using & to edit in place

# edit in place XML
from lxml import etree
import sys
myXML = sys.argv[1]

tree = etree.parse(myXML)
root = tree.getroot()
code = root.xpath("//Init")
for i in code:
    if (i.attrib['max_value']):
        i.attrib['max_value'] = '100'

etree.ElementTree(root).write(myXML, pretty_print=True)
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
0

Your specific problem is that in sed, the .* is "greedy." That is, it matches as much as it possibly can which can cause it to merge two or more fields into one.

You want to be a little more careful about what you match. For replacing a number, try just matching numeric digits, maybe with a decimal point:

s/max_value='[0-9.]*'/max_value='25'/g

In general, what you want to do is use a negated character class of the closing quote:

s/'[^']*'/ ...

But in this specific case, 0-9 does the job, and is slightly more clear. (You would not want to try to match every possible character in a sentence using a positive pattern this way - far better to use a negative pattern and just say "everything except the end quote, followed by the end quote".

aghast
  • 14,785
  • 3
  • 24
  • 56
  • `sed` is not a tool to recommend to people to parse xml nor html – Gilles Quénot Apr 02 '18 at 13:06
  • `sed` is also not a tool to parse English language text. But if I had a latex document where I needed to replace 'ran' with 'run,' I would reach for `sed`. It is also, of course, the tool the OP actually asked the question about. – aghast Apr 02 '18 at 13:14
  • `sed` is all about **TEXT** parsing. Maybe you upgrade your SQL database with `sed` too ? – Gilles Quénot Apr 02 '18 at 13:22
  • If one ignores that it is XML, this is just a simple find and replace operation that sed excels at. They aren't parsing it per se, but rather replacing one number with another. – gpojd Apr 02 '18 at 14:21
0

The problem is that you are including ' char in the .* subexpr. Better use:

xargs sed "/max_value=/s/max_value='[^']*'/max_value='${new_value}'/g"

note

Beware that ' is special char for the shell (so I used double quotes around the whole sed command)

Also take into account that the expression can appear not only in the places you are seaching for. As XML is not regular, it is not a good idea to parse it with a regular expression for matches. Use of a full XML parser would allow you to change all occurrences in an xml attribute basis, instead of plain text search. And take into account that grep(1) is a filter, you won't edit the files, you'll get that on standard output.

Case you want to edit the files, you can use ed(1) instead.

grep -rl max_value . |
while read file
do
    ed file <<EOF
    1,$s/max_value='[^']'/max_value='100'/g
    w
    q
EOF
done
Luis Colorado
  • 10,974
  • 1
  • 16
  • 31