sed regexp matching in a long line

Question

I have a XML file that I wish to extract all occurrences of some tag AB. The file is one long line with ~500 000 chars.

Now I do know about regexp and such, but when I try it with sed and try to extract only the characters within the tags I am totally lost regarding the result :).

Here's my command:

sed -r 's/(.*)<my_tag>([A-Z][A-Z])<\/my_tag>(.*)/hello\2/g' myfile.out

transforms the entire file with only "helloAB" e.g. While the expected should at least contain 100+ matches.

So I'm thinking around the concepts of greedy matching and such but not getting anywhere. Maybe awk is a better idea?

The `.*` bits are eating up everything. It would probably fix the issue to use a non-greedy version of both instances. — abiessu, Aug 29 '13 at 16:01
You would be best using a proper XML parsing utility for this, as XML is not a regular language, and so regular expressions are not the best tool for the job. You may be able to achieve some simple XML parsing with regexes, but, as you can see already for this simple case, the RE you need to use even here can get a bit tricky... — twalberg, Aug 29 '13 at 16:20

score 1 · Answer 1 · answered Aug 29 '13 at 18:09

If you have python (2.6+), this should be fairly trivial:

import xml.dom.minidom as MD
tree = MD.parse("yourfile.xml")
for e in tree.getElementsByTagName("AB"):
   print e.toprettyxml()

In general, trying to parse XML by hand should be avoided as there are much simpler solutions like these. Not to mention, these kinds of libraries will give you easy access to attributes and values without further parsing.

score 0 · Answer 2 · answered Aug 30 '13 at 06:51

Thank your for your answers.

I tried @MannyD's suggestion and unfortunately the XML didn't seem to be well formed, thus the parsing failed. Since I cannot anticipate only well formed XML's I made grep solution, which does the job.

grep -o "<my_tag>[A-Z][A-Z]</my_tag>" myfile.out | sort -u

The -o option flag will print each match on a new line, from there I just sort and print the unique matches from the file.

sed regexp matching in a long line

2 Answers2