It appears that google searches will give the following url:
/url?q= "URL WOULD BE HERE" &sa=U&ei=9LFsUbPhN47qqAHSkoGoDQ&ved=0CCoQFjAA&usg=AFQjCNEZ_f4a9Lnb8v2_xH0GLQ_-H0fokw
When subjected to a html parsing by BeautifulSoup.
I am getting the links by using soup.findAll('a')
and then using a['href'].
More specifically, the code I have used is the following:
import urllib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
import re
main_site = 'https://www.google.com/'
search = 'search?q='
query = 'pillows'
full_url = main_site+search+query
request = urllib2.Request(full_url, headers={'User-Agent': 'Chrome/16.0.912.77'})
main_html = urllib2.urlopen(request).read()
results = BeautifulSoup(main_html, parseOnlyThese=SoupStrainer('div', {'id': 'search'}))
try:
for search_hit in results.findAll('li', {'class':'g'}):
for elm in search_hit.findAll('h3',{'class':'r'}):
for a in elm.findAll('a',{'href':re.compile('.+')}):
print a['href']
except TypeError:
pass
Also, I have noticed on other sites that the a['href']
may return something like /dsoicjsdaoicjsdcj
where the link would take you to website.com/dsoicjsdaoicjsdcj
.
I know if this is the case that I can simply concatenate them, but I feel like it shouldn't be that I should have to change the way I parse up and treat the a['href']
based on which website I'm looking at. Is there a better way to get this link? Is there some javascript that I need to take into account? Surely there is a simply way in BeautifulSoup to get the full html to follow from a
?