1

First time sed'er, so be gentle.

I have the following text file, 'test_file':

 <Tag1>not </Tag1><Tag2>working</Tag2>

I want to extract the text in between <Tag2> using sed regex, there may be other occurrences of <Tag2> and I would like to extract those also.

So far I have this sed based regex:

cat test_file | grep -i "Tag2"| sed 's/<[^>]*[>]//g'

which gives the output:

 not working

Anyone any idea how to get this working?

Greg Bacon
  • 134,834
  • 32
  • 188
  • 245
Mount Stuart
  • 11
  • 1
  • 1
  • 2
  • From what you have written, I am guessing you only need the text between the Tag2 tags. Is that correct? If that is the case, do you know what cat test_file | grep -i "Tag2" outputs? –  Jan 27 '10 at 18:32
  • 1
    See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – bmargulies Jan 27 '10 at 18:50
  • 2
    Sorry to say this, but posting *the link* in reaction to a regex+(x)html related question without providing any more information, is perhaps as tiring as the question itself. Come to think of it, it's even more so. It's the equivalent of posting the notorious quote "programmer *bla bla bla* problem *couch* regex *gulp* has two problems!". If you feel the uncontrollable urge to post the link, at least give the original poster a slight indication of what s/he is about to do is not the best solution. – Bart Kiers Jan 27 '10 at 19:43

4 Answers4

4

As another poster said, sed may not be the best tool for this job. You may want to use something built for XML parsing, or even a simple scripting language, such as perl.

The problem with your try, is that you aren't analyzing the string properly.

cat test_file is good - it prints out the contents of the file to stdout.

grep -i "Tag2" is ok - it prints out only lines with "Tag2" in them. This may not be exactly what you want. Bear in mind that it will print the whole line, not just the <Tag2> part, so you will still have to search out that part later.

sed 's/&lt;[^&gt;]*[&gt;]//g' isn't what you want - it simply removes the tags, including <Tag1> and <Tag2>.

You can try something like:

cat tmp.tmp | grep -i tag2 | sed 's/.*<Tag2>\(.*\)<\/Tag2>.*/\1/'

This will produce

working

but it will only work for one tag pair.

Avi
  • 19,934
  • 4
  • 57
  • 70
  • 1
    +1 for **NOT** posting *the link* and patiently answering the question as well as warning that it is not a general solution to the problem. – Bart Kiers Jan 27 '10 at 19:46
4

For your nice, friendly example, you could use

sed -e 's/^.*<Tag2>//' -e 's!</Tag2>.*!!' test-file 

but the XML out there is cruel and uncaring. You're asking for serious trouble using regular expressions to scrape XML.

Greg Bacon
  • 134,834
  • 32
  • 188
  • 245
  • 3
    +1 for **NOT** posting *the link* and patiently answering the question as well as warning that it is not a general solution to the problem. – Bart Kiers Jan 27 '10 at 19:46
0

you can use gawk, eg

$ cat file
 <Tag1>not </Tag1><Tag2>working here</Tag2>
 <Tag1>not </Tag1><Tag2>
working

</Tag2>

$ awk -vRS="</Tag2>" '/<Tag2>/{gsub(/.*<Tag2>/,"");print}' file
working here

working
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
0
awk -F"Tag2" '{print $2}' test_1 | sed 's/[^a-zA-Z]//g'
Vijay
  • 65,327
  • 90
  • 227
  • 319