grep only area in between two strings

Question

I'm having this problem where in trying to grep something on an html page (specifically a user name) I try to retrieve the string by saying:

egrep -o dir\=\"[ltr]*\"\>.*(\<\/span|\<\/a)

By this I am trying to say: "get anything after dir=("ltr or rlt")> and before the first </a> or </span> closing tag.

so for example:

dir="ltr">myusername</span>

or

dir="rtl">myusername</a>

There are however multiple span tags on one line, and it is not stopping after the first one, which results in data that I don't want.

Is there a way to modify my current regex to stop after the first one? And why does it even continue reading?

Thanks

see also http://stackoverflow.com/questions/22221277/bash-grep-between-two-lines-with-specified-string — cp.engr, Feb 03 '16 at 22:44

score 2 · Accepted Answer · edited Oct 03 '12 at 02:20

2

You need to make the .* non-greedy by adding a ? to it.

egrep -o dir\=\"[ltr]*\"\>.*?(\<\/span|\<\/a)

A better solution is this (in raw regex, you will need to escape it):

dir="[ltr]{3}"[^>]*?>(.*?)(</span>|</a>)

Capture group 1 ($1) will contain what is between it, and capture group 2 ($2) will contain if its a span or a link termination.

edited Oct 03 '12 at 02:20

Matt Ball

answered Oct 03 '12 at 02:14

tweak2

score 0 · Answer 2 · answered Oct 03 '12 at 03:55

I would use GNU sed to do this:

sed -r 's/(dir="ltr"|dir="rtl")>([^<]+)(<\/span>|<\/a>).*/\2/' file.txt

You can make the regex a bit more clever and easier to read with some simplification:

sed -r 's/dir="(ltr|rtl)">([^<]+)<\/(span|a)>.*/\2/' file.txt

2 Answers2