-2

im trying to write small scraper script from google search, im write the program, bat have small problem i need regex for extract data-href value from google search, please help me :

exemple html code of google search :

data-href="www.buxmob.net/index.php?id=577">
data-href="www.webopedia.com/TERM/K/keyword.html">
data-href="moz.com/beginners-guide-to-seo/keyword-research">

need only the url present in this value, only this :

hxxp://www.webopedia.com/TERM/K/keyword.html
hxxp://moz.com/beginners-guide-to-seo/keyword-research
hxxp://www.buxmob.net/index.php?id=577

thanks you

pythoncoder
  • 37
  • 1
  • 2
  • 8

1 Answers1

0

All the examples you gave can be matched with

(?:data-href=")(.*?)(?:">)

See demo at http://regex101.com/r/rB4nS1

That does NOT mean it's a good idea to try to parse (general) html with regex - but sometimes, when the response is well formed and well known, you get away with it.

Note that you mentioned you wanted hxxp:// in front of the string - that is not the job of the regular expression, but belongs with the language you use to implement the expression. The above is a "non greedy match starting after the string data-href=" and ending at the next ">

Floris
  • 45,857
  • 6
  • 70
  • 122