The grep
command looks for any lines that include a match to
'<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"'
which is
<a the characters <a
[^>] not followed by a close '>'
\+ the last thing one or more times (this is really not necessary I think.
with this, it would be "not followed by exactly one '>' which would be fine
href followed by the string 'href'
[ ]* followed by zero or more spaces (you don't really need the [], just ' *' would be enough)
= followed by the equals sign
[ \t]* followed by zero or more space or tab ("white space")
" followed by open quote (but only a double quote...)
\( open bracket (grouping)
ht characters 'ht'
\| or
f character f
\) close group (of the either-or)
tp characters 'tp'
s\? optionally followed by s
Note - the last few lines combined means 'http or https or ftp or ftps'
: character :
[^"]\+ one or more characters that are not a double quote
this is "everything until the next quote"
Does that get you started? You can do the same for the next bit...
Note to confuse you - the backslash is used to change the meaning of some special characters like ()+
; just to keep everyone on their toes, whether these have special meaning with or without the backslash is not something that is defined by the regular expression syntax, but rather by the command in which you use it (and its options). For example, sed
changes the meaning of things depending on whether you use the -E
flag.