Simple Grep Mismatch problem

Question

I am using Ubuntu 10.10 and using Grep to process some HTML files.

Here is the HTML snippet:

<a href="video.php?video=one-hd.mov"><img src="/1.jpg"><a href="video.php?video=normal.mov"><img src="/2.jpg"><a href="video.php?video=another-hd.mov">

I would like to extract one-hd.mov and another-hd.mov but ignore normal.mov.

Here is my code:

example='<a href="video.php?video=one-hd.mov"><img src="/1.jpg"><a href="video.php?video=normal.mov"><img src="/2.jpg"><a href="video.php?video=another-hd.mov">'
echo $example | grep -Po '(?<=video.php\?video=).*?(?=-hd.mov">)'

The result is:

one
normal.mov"><img src="/2.jpg"><a href="video.php?video=another

But I want

one
another

There is a mismatch there.

Is this because of the so-called Greedy Regular Expression?

I am sing GREP but any command line bash tools are welcome to solve this problem like sed etc.

Thanks a lot.

As per the usual answer, do not EVER try to match HTML with a regex: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Marc B, Jul 10 '11 at 20:59
@Marc B then I should use what to process HTML. Actually Grep if gr8 for HTML processing. — DocWiki, Jul 10 '11 at 21:14
Grep is great for FINDING things. But the only thing that can handle HTML properly is a DOM parser. Anything else will just turn around to bite you at some point. — Marc B, Jul 11 '11 at 03:36

score 2 · Accepted Answer · answered Jul 10 '11 at 21:06

2

You want use Perl regexes for grep - why not directly perl?

echo "$example" | perl -nle 'm/.*?video.php\?video=([^"]+)">.*video.php\?video=([^"]+)".*/; print "=$1=$2="'

will print

=one-hd.mov=another-hd.mov=

answered Jul 10 '11 at 21:06

clt60

62,119
17
107
194

I know. The secret is `[^"]`. The following works too: `echo $example | grep -Po '(?<=video.php\?video=)([^"]+)(?=-hd.mov">)'` – DocWiki Jul 10 '11 at 21:14

score 1 · Answer 2 · answered Jul 10 '11 at 21:10

1

Here is a solution using xmlstarlet:

$ example='<a href="video.php?video=one-hd.mov"><img src="/1.jpg"><a href="video.php?video=normal.mov"><img src="/2.jpg"><a href="video.php?video=another-hd.mov">'
$ echo $example | xmlstarlet fo -R 2>/dev/null | xmlstarlet sel -t -m "//*[substring(@href, string-length(@href) - 6, 7) = '-hd.mov']" -v 'substring(@href,17, string-length(@href) - 17 - 3)' -n
one-hd
another-hd

$

answered Jul 10 '11 at 21:10

yankee

38,872
15
103
162

You even have to count the characters?`substring(@href,17, string-length(@href) - 17 - 3)`? It is not as convenient as grep. – DocWiki Jul 10 '11 at 21:35
@DocWiki: But it is more fail-safe. You can of course replace the numbers with string-length() functions and static values or check which other string functions xmlstarlet supports. This is just an example on how to work with it. You can also combine xmlstarlet and grep, using xmlstarlet to just extract the href-attributes and use grep to get the filenames that you want if you feel more comfortable with it. – yankee Jul 10 '11 at 21:41

score 1 · Answer 3 · answered Jul 10 '11 at 21:18

Solution using awk:

{
    for(i=1;i<NF;i++) {
        if ($i ~ /mov/) {
            if ($i !~ /normal/){
                sub(/^.*=/, "", $i)
                print $i
            }
        }
    }
}

outputs:

$ awk -F'"' -f h.awk html
one-hd.mov
another-hd.mov

But I strongly advice you to use a html-parser for this instead, something like BeautifulSoup

Simple Grep Mismatch problem

3 Answers3