-1

I am updating a script that is parsing google search results. Google changed the way the results are returned and I need to update my regex. Currently the issue is getting the regex to stop after seeing the ampersand.

Current regex re_urls = re.compile('<a href="/url\?q=(.*?)"')

This returns for example: http://www.example.com/test&amp;sa=U&amp;ei=3gdhVOfSJOr1iQKnwoBg&amp;ved=0CBQQFjAA&amp;usg=AFQjCNHPaPBdpjIJFynGKhW1As1fg9r8Aw

How do I get it to just return http://www.example.com/test

Siggy
  • 47
  • 1
  • 5
  • It is not recommended to use regexes for this. Check [this](http://stackoverflow.com/a/1732454/1224076) out. Try [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) instead. – Phani Nov 10 '14 at 18:52
  • don't use regex to parse html – Padraic Cunningham Nov 10 '14 at 18:57
  • its google search results and I don't want to use BeautifulSoup I am trying to make the script as modular as possible. – Siggy Nov 10 '14 at 18:57
  • @Phani: I find for fixed machine-generated HTML that regexes can often be a good solution. – nneonneo Nov 10 '14 at 18:58
  • @Siggy, there is a reason why people recommend not using regex to parse html and I don't get the modular part – Padraic Cunningham Nov 10 '14 at 19:00
  • 1
    @Siggy: parsing HTML is hardly modular. You have observed once already that any change in the format of the response wrecks your script. Instead, there is an API for that. – njzk2 Nov 10 '14 at 19:00
  • @PadraicCunningham modular as in aside from python nothing else has to be installed for script to work. – Siggy Nov 10 '14 at 19:10
  • @Siggy, until your regex breaks again ;) – Padraic Cunningham Nov 10 '14 at 19:12
  • @Siggy: That's almost the exact opposite meaning as "modular". – abarnert Nov 10 '14 at 20:13
  • Also, are you aware that it's generally against Google's Terms of Service to scrape their web pages when they provide APIs to access the same information? (That may even be part of the reason they periodically change the output format, although as far as I know they've never confirmed that.) If you're sure that there are no legal issues for you, or that you just don't care, that's fine, but make sure you're doing it knowingly. – abarnert Nov 10 '14 at 20:15

1 Answers1

1

If you aren't interested in the ampersands, you can use simply

r'<a href="/url\?q=([^&"]*)'

That uses a character class that excludes " and & characters, and matches all other characters greedily.

nneonneo
  • 171,345
  • 36
  • 312
  • 383