Let's first clarify a couple of misunderstandings.
I'm learning about Bash scripting, and need some help understanding regex's.
You seem to be implying some sort of relation between Bash and regex.
As if Bash was some sort of regex engine.
It isn't. The [[
builtin is the only thing I recall in Bash that supports regular expressions, but I think you mean something else.
There are some common commands executed in Bash that support some implementation of regular expressions such as grep
or sed
and others. Maybe that's what you meant. It's good to be specific and accurate.
I want to get the id's of all linked photos in this page. I just need help figuring out what the RegEx should be.
This suggests an underlying assumption that if you want to extract content from an HTML, then regex is the way to go. That assumption is incorrect.
Although it's best to extract content from HTML using an XML parser (using one of the suggestions in Gilles' answer),
and trying to use regex for it is not a good reflect,
for simple cases like yours it might just be good enough:
grep -oP '\./download/file\.php\?id=\K\d+(?=&mode=view)' file.html
Take note that you escaped the wrong characters in the regex:
/
and &
don't have a special meaning and don't need to be escaped
.
and ?
have special meaning and need to be escaped
Some extra tricks in the above regex are good to explain:
- The
-P
flag of grep
enables Perl style (powerful) regular expressions
\K
is a Perl specific symbol, it means to not include in the match the content before the \K
- The
(?=...)
is a zero-width positive lookahead assertion. For example, /\w+(?=\t)/
matches a word followed by a tab, without including the tab in the match.
- The
\K
and the lookahead trickery is to work with grep -o
, which outputs only the matched part. But without these trickeries the matched part would be for example ./download/file.php?id=123456&mode=view
, which is more than what you want.