0

It appears that google searches will give the following url:

/url?q=  "URL WOULD BE HERE"    &sa=U&ei=9LFsUbPhN47qqAHSkoGoDQ&ved=0CCoQFjAA&usg=AFQjCNEZ_f4a9Lnb8v2_xH0GLQ_-H0fokw

When subjected to a html parsing by BeautifulSoup.

I am getting the links by using soup.findAll('a') and then using a['href'].

More specifically, the code I have used is the following:

import urllib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
import re

main_site = 'https://www.google.com/'
search = 'search?q=' 
query = 'pillows'
full_url = main_site+search+query
request = urllib2.Request(full_url, headers={'User-Agent': 'Chrome/16.0.912.77'})
main_html = urllib2.urlopen(request).read()

results = BeautifulSoup(main_html, parseOnlyThese=SoupStrainer('div', {'id': 'search'}))
try:
    for search_hit in results.findAll('li', {'class':'g'}):
        for elm in search_hit.findAll('h3',{'class':'r'}):
            for a in elm.findAll('a',{'href':re.compile('.+')}):
                print a['href']

except TypeError:
    pass

Also, I have noticed on other sites that the a['href'] may return something like /dsoicjsdaoicjsdcj where the link would take you to website.com/dsoicjsdaoicjsdcj. I know if this is the case that I can simply concatenate them, but I feel like it shouldn't be that I should have to change the way I parse up and treat the a['href'] based on which website I'm looking at. Is there a better way to get this link? Is there some javascript that I need to take into account? Surely there is a simply way in BeautifulSoup to get the full html to follow from a?

chase
  • 3,592
  • 8
  • 37
  • 58

2 Answers2

0
SoupStrainer('div', {'class': "vsc"})

returns nothing cause when you do:

print main_html

and search for "vsc", there is no result

nnaelle
  • 892
  • 1
  • 8
  • 22
0

You're looking for this:

# container with needed data: title, link, etc.
for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf a')['href']

Also, while using requests library, you can pass URL params easily like so:

# this:
main_site = 'https://www.google.com/'
search = 'search?q=' 
query = 'pillows'
full_url = main_site+search+query

# could be translated to this:
params = {
  'q': 'minecraft',
  'gl': 'us',
  'hl': 'en',
}
html = requests.get('https://www.google.com/search', params=params)

While using urllib you can do it like so (In python 3, this has been moved to urllib.parse.urlencode):

# https://stackoverflow.com/a/54050957/15164646
# https://stackoverflow.com/a/2506425/15164646

url = "https://disc.gsfc.nasa.gov/SSW/#keywords="
params = {'keyword':"(GPM_3IMERGHHE)", 't1':"2019-01-02", 't2':"2019-01-03", 'bboxBbox':"3.52,32.34,16.88,42.89"}

quoted_params = urllib.parse.urlencode(params)
# 'bboxBbox=3.52%2C32.34%2C16.88%2C42.89&t2=2019-01-03&keyword=%28GPM_3IMERGHHE%29&t1=2019-01-02'

full_url = url + quoted_params
# 'https://disc.gsfc.nasa.gov/SSW/#keywords=bboxBbox=3.52%2C32.34%2C16.88%2C42.89&t2=2019-01-03&keyword=%28GPM_3IMERGHHE%29&t1=2019-01-02'

resp = urllib.urlopen(full_url).read()

Code and example in the online IDE:

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    'User-agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}

params = {
  'q': 'minecraft',
  'gl': 'us',
  'hl': 'en',
}

html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf a')['href']
  print(link)

---------
'''
https://www.minecraft.net/en-us/
https://classic.minecraft.net/
https://play.google.com/store/apps/details?id=com.mojang.minecraftpe&hl=en_US&gl=US
https://en.wikipedia.org/wiki/Minecraft
'''

Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.

The difference in your case is that you don't have to make everything from scratch, bypass blocks, and maintain the parser over time.

Code to integrate to achieve your goal:

import os
from serpapi import GoogleSearch

params = {
  "engine": "google",
  "q": "minecraft",
  "hl": "en",
  "gl": "us",
  "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
  print(result['link'])

---------
'''
https://www.minecraft.net/en-us/
https://classic.minecraft.net/
https://play.google.com/store/apps/details?id=com.mojang.minecraftpe&hl=en_US&gl=US
https://en.wikipedia.org/wiki/Minecraft
'''

Disclaimer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35