Help with sed regex: extract text from specific tag

Question

First time sed'er, so be gentle.

I have the following text file, 'test_file':

 <Tag1>not </Tag1><Tag2>working</Tag2>

I want to extract the text in between <Tag2> using sed regex, there may be other occurrences of <Tag2> and I would like to extract those also.

So far I have this sed based regex:

cat test_file | grep -i "Tag2"| sed 's/<[^>]*[>]//g'

which gives the output:

 not working

Anyone any idea how to get this working?

From what you have written, I am guessing you only need the text between the Tag2 tags. Is that correct? If that is the case, do you know what cat test_file | grep -i "Tag2" outputs? — , Jan 27 '10 at 18:32
See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — bmargulies, Jan 27 '10 at 18:50
Sorry to say this, but posting *the link* in reaction to a regex+(x)html related question without providing any more information, is perhaps as tiring as the question itself. Come to think of it, it's even more so. It's the equivalent of posting the notorious quote "programmer *bla bla bla* problem *couch* regex *gulp* has two problems!". If you feel the uncontrollable urge to post the link, at least give the original poster a slight indication of what s/he is about to do is not the best solution. — Bart Kiers, Jan 27 '10 at 19:43

score 4 · Answer 1 · answered Jan 27 '10 at 19:03

As another poster said, sed may not be the best tool for this job. You may want to use something built for XML parsing, or even a simple scripting language, such as perl.

The problem with your try, is that you aren't analyzing the string properly.

cat test_file is good - it prints out the contents of the file to stdout.

grep -i "Tag2" is ok - it prints out only lines with "Tag2" in them. This may not be exactly what you want. Bear in mind that it will print the whole line, not just the <Tag2> part, so you will still have to search out that part later.

sed 's/<[^>]*[>]//g' isn't what you want - it simply removes the tags, including <Tag1> and <Tag2>.

You can try something like:

cat tmp.tmp | grep -i tag2 | sed 's/.*<Tag2>\(.*\)<\/Tag2>.*/\1/'

This will produce

working

but it will only work for one tag pair.

+1 for **NOT** posting *the link* and patiently answering the question as well as warning that it is not a general solution to the problem. — Bart Kiers, Jan 27 '10 at 19:46

score 4 · Answer 2 · answered Jan 27 '10 at 19:34

4

For your nice, friendly example, you could use

sed -e 's/^.*<Tag2>//' -e 's!</Tag2>.*!!' test-file

but the XML out there is cruel and uncaring. You're asking for serious trouble using regular expressions to scrape XML.

answered Jan 27 '10 at 19:34

Greg Bacon

134,834
32
188
245

3

+1 for **NOT** posting *the link* and patiently answering the question as well as warning that it is not a general solution to the problem. – Bart Kiers Jan 27 '10 at 19:46

score 0 · Answer 3 · answered Jan 28 '10 at 00:21

0

you can use gawk, eg

$ cat file
 <Tag1>not </Tag1><Tag2>working here</Tag2>
 <Tag1>not </Tag1><Tag2>
working

</Tag2>

$ awk -vRS="</Tag2>" '/<Tag2>/{gsub(/.*<Tag2>/,"");print}' file
working here

working

answered Jan 28 '10 at 00:21

ghostdog74

327,991
56
259
343

Vijay · Answer 4 · 2010-02-08T15:08:58.103

0

awk -F"Tag2" '{print $2}' test_1 | sed 's/[^a-zA-Z]//g'

edited Feb 08 '10 at 15:08

answered Feb 08 '10 at 15:03

Vijay

65,327
90
227
319

Help with sed regex: extract text from specific tag

4 Answers4