(As has been said many times before, the best solution is to use an HTML parser.)
With GNU grep
, try this simplified version:
grep -zPo '<img alt=[^/]+?src="\K[^"]+' ~/movie_local
A fixed version of your original attempt (note the (?s)
prefix; see below for an explanation):
grep -zPo '(?s)> <img alt=".*?src="\K.*?(?=")' ~/movie_local
Alternative, with [\s\S]
used ad-hoc to match any char., including \n
:
grep -zPo '> <img alt="[\s\S]*?src="\K.*?(?=")' ~/movie_local
As for why your attempt didn't work:
When you use -P
(for PCRE (Perl-Compatible Regular Expression support), .
does not match \n
chars. by default, so even though you're using -z
to read the entire input at once, .*
won't match across line boundaries. You have two choices:
- Set option
s
("dotall") at the start of the regex - (?s)
- this makes .
match any character, including \n
- Ad-hoc workaround: use
[\s\S]
instead of .
As an aside: the \K
construct is a syntactically simpler and sometimes more flexible alternative to a lookbehind assertion ((?<=...)
.
- Your command had both, which did no harm in this case, but was unnecessary.
- By contrast, had you tried
(?<=>\s*<img alt=")
for more flexible whitespace matching - note the \s*
in place of the original single space - your lookbehind assertion would have failed, because lookbehind assertions must be of fixed length (at least as of GNU grep
v2.26).
However, using just \K
would have worked: >\s*<img alt=")\K
.
\K
simply removes everything matched so far (doesn't include it in the output).