0

I have an output line that I regexed that looks like this:

<a href="google.com">"test link"</a><br>

how do I go about capturing google.com without quotes into a variable? Given the url could contain many '/' e.g. (random made up gibberish below)

http://www.google.com/search/something/lulz/here2;i=!mfo1iu489fn1o2jlk21m4098mdoi

EDIT: I would want the entire url string and not just www.google.com in the above case.

note: don't wish to load down 3rd party libraries etc. in order to perform this action.

Anthony Miller
  • 15,101
  • 28
  • 69
  • 98
  • 2
    I accept downvotes without comments as hugs. <3 – Anthony Miller Apr 19 '13 at 19:58
  • 1
    I am not the downvoter, but I am guessing because this is yet **another** request with help parsing html with regex. See stackoverflow's [most upvoted answer](http://stackoverflow.com/a/1732454/1032785) – jordanm Apr 19 '13 at 20:11
  • And you would be wrong in that assumption as I asked for any native bash command. The only mention of regex is the fact that I already stripped the href line using regex from the html... but I wasn't asking for someone to use regex to parse the field data I need. I already know it's not possible seeing as there is no 'non-capturing group' available for regex. (not attacking you, just explaining in case that is the reason) – Anthony Miller Apr 19 '13 at 20:13
  • grep and cut are not native bash commands. See the `SHELL BUILTIN COMMANDS` section of the manpage for a full list. – jordanm Apr 19 '13 at 20:15
  • i thought since those commands already worked in bash on my install (centos5) I figured it was native... my mistake. EDIT: Edited the note portion of the question and the tags – Anthony Miller Apr 19 '13 at 20:18
  • 2
    @Jordanm, sry but this question is about parsing __SOME_KNOWN_CHARS="wantedchars"OTHER_KNOWN_CHARSrandom_charsEND_CHARS. So, it is not about the parsing HTML... This is special case what CAN be done easily without full-blown html-parser.. right? – clt60 Apr 19 '13 at 20:31
  • "I've finally figured out an easy way to parse HTML with regex." -- Fermat's second-to-last theorem. – Don Branson Apr 19 '13 at 20:33
  • @jordanm +1 - very cool. Thanks for the pointer. Everyone should go read that question and its entertaining answers. – Don Branson Apr 19 '13 at 20:37
  • @DonBranson "Chuck Norris can parse HTML with regex" I loled... Chuck Norris jokes still pops a grin on my face XD – Anthony Miller Apr 19 '13 at 21:33
  • @Mechaflash - Agreed. But I guess no one appreciates a good Fermat joke anymore. – Don Branson Apr 19 '13 at 21:58

2 Answers2

3

Try this pure-bash regex solution

shopt -s nocasematch    #Dont care about the character case
text='<a href="hTTtp://www.google.com/search/something/lulz/here2;i=!mfo1iu489fn1o2jlk21m4098mdoi">"test link"</a><br>'
regex='(<a\ +href=\")([^\"]+)(\">)'
[[ $text =~ $regex ]] && echo ${BASH_REMATCH[2]}
clt60
  • 62,119
  • 17
  • 107
  • 194
  • :) you're welcome - the script has some limitations. the url must be between double quotes, and it can't contain double quote as a valid char. As @jordan told - it is not OK parsing HTML with regexes... :) – clt60 Apr 19 '13 at 21:13
  • which by standard usage this should be the case and should work for my needs. – Anthony Miller Apr 19 '13 at 21:31
2
shopt -s nocasematch

TEXT='<a href="http://www.google.com/search/something/lulz/here2;i=!mfo1iu489fn1o2jlk21m4098mdoi">"test link"</a><br>'

TEXT=${TEXT##*href=\"}
TEXT=${TEXT%%\"*}
TEXT=${TEXT##*//}
TEXT=${TEXT%%/*}

echo $TEXT
Ziffusion
  • 8,779
  • 4
  • 29
  • 57
  • I would still need the entirety of the href value in the case of a url with multiple depths '/' – Anthony Miller Apr 19 '13 at 20:22
  • Not sure what you mean. You wish to extract the URL as well as the hostname? – Ziffusion Apr 19 '13 at 20:56
  • meant to put 'without multiple depths'. such as if the url was just simply google.com. I noticed your code relies on '/' existing in the url. – Anthony Miller Apr 19 '13 at 21:00
  • @Mechaflash Updated. Note that this is very fast because it does not invoke any external commands. This functionality is built into bash itself. – Ziffusion Apr 19 '13 at 21:15