Use regex to retrieve string between characters

Question

I would like to either use a grep command or just know the regex to get the following string between the ">" and "<" characters.

string :

<f id=mos-title>demo-break-1</f>

I would like to return

demo-break-1

[Here's the regex you need.](http://en.wikipedia.org/wiki/XPath) — , Mar 14 '13 at 22:10
Another way to do it : http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Gilles Quénot, Mar 14 '13 at 22:13

score 0 · Answer 1 · answered Mar 15 '13 at 00:08

suppose file foo contains:

<f id=mos-title>demo-break-1</f>
<f id=mos-title>demo-break-2</f>
<f id=mos-title>demo-break-3</f>
<a>foo testing</a>

You could do something like this:

perl -ne 'print "$1\n" if /<.+id=mos-title>(.+?)<\/f>/' foo

Keep in mind that this would be strict as to having these matches only occur on one line. Also, you will have to account for any deviations in the format since this is not a valid HTML parser.

Here's a more relaxed approach as far as being strict, but still not 100% HTML compliant.

perl -ne 'print "$1\n" if /<.+id=mos-title\b.*?>\s*(.+?)\s*<\/f>/' foo

Output would be as follows:

demo-break-1
demo-break-2
demo-break-3

score 0 · Answer 2 · answered Mar 15 '13 at 00:36

If you have a proper xml document like this:

<root>
  <f id="mos-title">demo-break-1</f>
</root>

you can use a proper parser:

xmllint --xpath "/root/f[@id='mos-title']" input.xml | \
      sed 's/[^>]*>\([^<]*\)<[^>]*>/\1\n/g'

With the input you have, it you are sure that the input format is consistent (i.e., generated) you can use sed:

sed 's/[^>]*>\([^<]*\)<[^>]*>/\1/g' input

Scrutinizer · Answer 3 · 2013-03-16T12:01:46.577

0

It is usually best to use an XML-parser, but you could try this awk:

awk '$1==s{print $2}' s="f id=mos-title" RS=\< FS=\> file

edited Mar 16 '13 at 12:01

answered Mar 16 '13 at 11:56

Scrutinizer

9,608
1
21
22

Use regex to retrieve string between characters

3 Answers3