2
> <img alt="Citizen Kane Poster" title="Citizen Kane Poster"
src="https://images-na.ssl-images-amazon.com/images/M/MV5BMTQ2Mjc1MDQwMl5BMl5BanBnXkFtZTcwNzUyOTUyMg@@._V1_UX182_CR0,0,182,268_AL_.jpg"
itemprop="image" />

I want to extract the url of the poster from the above text. This is my grep statement:

count=$(grep -zPo '(?<=> <img alt=").*?src="\K.*?(?="itemprop="image")'  ~/movie_local)

movie_local was where I had saved the page source of the site. I am learning grep and haven't got a complete command over it,so please do go soft on me.Could you please help me out? :)

Swastik Udupa
  • 316
  • 3
  • 17
  • 3
    First off, [don't use regex to parse HTML](http://stackoverflow.com/a/1732454/4060711). What you're looking for though, is a [capturing group](http://stackoverflow.com/questions/1891797/capturing-groups-from-a-grep-regex). – 3ocene Oct 23 '16 at 05:48
  • I will look into it,thank you :) but is there no way I can do this using grep?Even with the extended perl regex is it not possible? I have done this on multiple occasions,but for this chunk of code it isn't working – Swastik Udupa Oct 23 '16 at 05:50
  • 1
    A capturing group is a way to extract part of a match. I'm not familiar with how to use them in grep as command line is a little weird. That second link seems to explain it quite well though. – 3ocene Oct 23 '16 at 05:52
  • 2
    I suggest to use an XML/HTML parser (xmllint, xmlstarlet ...). – Cyrus Oct 23 '16 at 06:07
  • 1
    Try `(?s)(?<=>\s*).*?src= – Wiktor Stribiżew Oct 23 '16 at 08:01
  • Can you explain the terms? I am really sorry,I am a newbie. Why do we use ?s and \s* ? – Swastik Udupa Oct 23 '16 at 08:04
  • 1
    `(?s)` is the ["dotall" option](http://www.pcre.org/original/doc/html/pcresyntax.html#SEC16) that makes `.` match `\n` too (which is the main problem with your regex); @WiktorStribiżew's comment provided the crucial pointer, but his regex uses a _variable-length_ lookbehind assertion, which isn't supported (at least as of GNU `grep` v2.26) - however, you don't need one at all, since you're already using `\K`. Wiktor added the `\s*` simply to make whitespace matching more flexible. – mklement0 Oct 23 '16 at 16:32
  • 2
    @3ocene: `grep` implementations (at least none I'm aware of) do not support capture groups: while you can use `(...)` for _grouping_, you won't be able to access what those groups _captured_. However, combining the `-o` option (return only the (entire) captured part) with lookaround assertions (requires GNU `grep` with `-P`) is a (limited) substitute for true capture-group support. It's exactly what this question's command attempts, but fails (for different reasons). – mklement0 Oct 23 '16 at 16:38
  • 2
    Actually, I added the `\s*` "manually" after testing. It won't work with the * quantifier. Anyway, yes, it is just a hint aand not an attempt to answer. – Wiktor Stribiżew Oct 23 '16 at 17:37

1 Answers1

2

(As has been said many times before, the best solution is to use an HTML parser.)

With GNU grep, try this simplified version:

grep -zPo '<img alt=[^/]+?src="\K[^"]+' ~/movie_local

A fixed version of your original attempt (note the (?s) prefix; see below for an explanation):

grep -zPo '(?s)> <img alt=".*?src="\K.*?(?=")' ~/movie_local

Alternative, with [\s\S] used ad-hoc to match any char., including \n:

grep -zPo '> <img alt="[\s\S]*?src="\K.*?(?=")' ~/movie_local

As for why your attempt didn't work:

  • When you use -P (for PCRE (Perl-Compatible Regular Expression support), . does not match \n chars. by default, so even though you're using -z to read the entire input at once, .* won't match across line boundaries. You have two choices:

    • Set option s ("dotall") at the start of the regex - (?s) - this makes . match any character, including \n
    • Ad-hoc workaround: use [\s\S] instead of .
  • As an aside: the \K construct is a syntactically simpler and sometimes more flexible alternative to a lookbehind assertion ((?<=...).

    • Your command had both, which did no harm in this case, but was unnecessary.
    • By contrast, had you tried (?<=>\s*<img alt=") for more flexible whitespace matching - note the \s* in place of the original single space - your lookbehind assertion would have failed, because lookbehind assertions must be of fixed length (at least as of GNU grep v2.26).
      However, using just \K would have worked: >\s*<img alt=")\K.
      \K simply removes everything matched so far (doesn't include it in the output).
mklement0
  • 382,024
  • 64
  • 607
  • 775