-2
grep -i -o '<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"' | sed -e 's/^.*"\([^"]\+\)".*$/\1/g'

After trawling the internet finding the answer to my homework question, I finally got the above. But I don't completely understand the meaning of the two regular expressions used with sed and grep. Can somebody please shed some light on me? Thanks in advance.

  • 4
    If your teacher is telling you to parse html with regexes, you really need to put a thumbtack on their chair – Marc B Apr 03 '14 at 20:17
  • Is your question about `grep`, about `sed`, or about the regular expression? Paste the regex into regex101.com and it will give you a nice explanation. – Floris Apr 03 '14 at 20:17
  • As @MarcB says - html and regex are not friends. Use a real parser. My favorite is BeautifulSoup (python) but there are many others. – Floris Apr 03 '14 at 20:18
  • Why is it off-topic ? Yes it's about my homework but doesn't mean that I'm not allowed to seek help on this website. I dont understand why the expressions are constructed that way, so I posted a question. I dont see anything wrong with this – FoolishBeat Apr 03 '14 at 20:18
  • Why don't you write down what you _do_ understand; the chances of someone being willing to give you "the bits you are missing" will be much better. – Floris Apr 03 '14 at 20:19
  • Trying to parse out something semi-regular like a url is an excellent way to get people thinking about regular expressions. It just is painfully inefficient and impractical. – omgz0r Apr 03 '14 at 20:19
  • @Floris: I just dont understand what the expressions mean – FoolishBeat Apr 03 '14 at 20:19
  • You do know some regex, don't you? Otherwise how can this be homework... As an aside, here is the [obligatory link](http://stackoverflow.com/a/1732454/1967396) - somebody had to do it... – Floris Apr 03 '14 at 20:20
  • I know some simple ones. ([^"]\+\) <<-- what does this mean? – FoolishBeat Apr 03 '14 at 20:23
  • check this: http://bit.ly/1gSBTx1 – clt60 Apr 03 '14 at 20:26
  • @MarcB: The missing part is: wget -O - http//google.com – FoolishBeat Apr 03 '14 at 20:26

1 Answers1

2

The grep command looks for any lines that include a match to

'<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"'

which is

<a     the characters <a
[^>]   not followed by a close '>'
\+     the last thing one or more times (this is really not necessary I think.
       with this, it would be "not followed by exactly one '>' which would be fine
href   followed by the string 'href'
[ ]*   followed by zero or more spaces (you don't really need the [], just ' *' would be enough)
=      followed by the equals sign
[ \t]* followed by zero or more space or tab ("white space")
"      followed by open quote (but only a double quote...)
\(     open bracket (grouping)
ht     characters 'ht'
\|     or
f      character f
\)     close group (of the either-or)
tp     characters 'tp'
s\?    optionally followed by s
       Note - the last few lines combined means 'http or https or ftp or ftps'
:      character :
[^"]\+ one or more characters that are not a double quote
       this is "everything until the next quote"

Does that get you started? You can do the same for the next bit...

Note to confuse you - the backslash is used to change the meaning of some special characters like ()+; just to keep everyone on their toes, whether these have special meaning with or without the backslash is not something that is defined by the regular expression syntax, but rather by the command in which you use it (and its options). For example, sed changes the meaning of things depending on whether you use the -E flag.

Floris
  • 45,857
  • 6
  • 70
  • 122
  • @FoolishBeat you are welcome. From the comments you can see this was not really "a good question" by the standards of SO. But I could tell from the comments what you were really after - so it could be salvaged. Do look around to get a feeling for the kinds of things that people typically ask - and specifically, when you have a homework question, show you have made an effort. If you had expanded your question with "I understand _these_ bits mean _xyz_, but I am struggling to understand this part", rather than "here's something I found on Google. Help me." you would not have gotten a down vote. – Floris Apr 03 '14 at 20:43
  • Yeah that's why people seemed to help me build a house with a lot of bricks. I learnt a lesson. BTW, the "sed -e 's/^.*"\([^"]\+\)".*$/\1/g' " means that keep the all the first matching urls and delete whatever is not matched...doesn't it? – FoolishBeat Apr 03 '14 at 20:59
  • @foolishbeat : Actually - it means "match the first thing in double quotes that isn't just an empty string". That might not be the first URL. You would need an expression closer to the one you used for `grep` to have a better chance. A line like: `He said "hello"
    ` would return `hello` with your expression.
    – Floris Apr 03 '14 at 23:27