1

I'm having this problem where in trying to grep something on an html page (specifically a user name) I try to retrieve the string by saying:

egrep -o dir\=\"[ltr]*\"\>.*(\<\/span|\<\/a)

By this I am trying to say: "get anything after dir=("ltr or rlt")> and before the first </a> or </span> closing tag.

so for example:

dir="ltr">myusername</span>

or

dir="rtl">myusername</a>

There are however multiple span tags on one line, and it is not stopping after the first one, which results in data that I don't want.

Is there a way to modify my current regex to stop after the first one? And why does it even continue reading?

Thanks

Sam
  • 2,309
  • 9
  • 38
  • 53
  • see also http://stackoverflow.com/questions/22221277/bash-grep-between-two-lines-with-specified-string – cp.engr Feb 03 '16 at 22:44

2 Answers2

2

You need to make the .* non-greedy by adding a ? to it.

egrep -o dir\=\"[ltr]*\"\>.*?(\<\/span|\<\/a)

A better solution is this (in raw regex, you will need to escape it):

dir="[ltr]{3}"[^>]*?>(.*?)(</span>|</a>)

Capture group 1 ($1) will contain what is between it, and capture group 2 ($2) will contain if its a span or a link termination.

See it in action: http://regexr.com?32b8k

Matt Ball
  • 354,903
  • 100
  • 647
  • 710
tweak2
  • 646
  • 5
  • 15
0

I would use GNU sed to do this:

sed -r 's/(dir="ltr"|dir="rtl")>([^<]+)(<\/span>|<\/a>).*/\2/' file.txt

You can make the regex a bit more clever and easier to read with some simplification:

sed -r 's/dir="(ltr|rtl)">([^<]+)<\/(span|a)>.*/\2/' file.txt
Steve
  • 51,466
  • 13
  • 89
  • 103