1

I want to search company information automatically in google. Please see my code as below. HTTP Error 403: Forbidden or HTTP Error 404: Forbidden is happen.

from urllib.request import urlopen
from bs4 import BeautifulSoup
from urllib import parse
import openpyxl


wd = openpyxl.load_workbook('C:/Users/Lee Jung 
Un/Documents/hopeyouwork.xlsx')#locate of excel file.
ws = wd.active

def bs(Eng_name):
    url = "https://www.google.co.kr/search?ei=hWEaW-bKEMnb8QWa1IrQDw&q="
    q = parse.quote(Eng_name)
    html = urlopen(url + q)
    bsObj = BeautifulSoup(html, "html.parser")
    twg = bsObj.select("div.ifM9O > div:nth-child(2) > div.kp-header > div > 
    div.DI6Ufb > div > div > div.d1rFIf > div.kno-ecr-pt.kno-fb-ctx > span")

    if bool(twg):
        return twg.text    
    else:
        info = "none"
        return info


def companyname():
    for r in ws.rows:
        row_index = r[0].row
        Eng_name = r[1].value
        Kor_name = bs(Eng_name)
        ws.cell(row=row_index, column=1).value = row_index
        ws.cell(row=row_index, column=2).value = Eng_name
        ws.cell(row=row_index, column=3).value = Kor_name
        wd.save("Done.xlsx")
    wd.close()

 companyname()
U13-Forward
  • 69,221
  • 14
  • 89
  • 114

3 Answers3

0

Try setting the User-Agent HTTP header. Default User-Agent strings of popular programming libraries are banned on many webservers to avoid abuse caused by bots.

Also remember that there are other limits in Google too, like the limit on number of queries you may execute. At some point you may be shown a CAPTCHA or even banned from further queries from an IP, if you do query too much.

If that's your case you will probably need to read through their documentation and consider that some features may not be free.

jaboja
  • 2,178
  • 1
  • 21
  • 35
0

You have been blocked by google probably. See if you can access URL from browser still. You need to add user agent in headers and delay between each URL request and may be connect via proxy if you are blocked for long.

May I suggest to use requests package which is built on top of urllib and gives better flexibility while coding.

ex :

headers = {
    'User-Agent': ('Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) '
                  'Gecko/20100101 Firefox/61.0'),
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
}

## proxies is optional

ip = 'blah-blah'
port = 'blah'

proxies = {"http": 'http://' + ip + ':' + port,
           "https": 'http://' + ip + ':' + port}

html = requests.get(url, headers=headers, proxies=proxies)

Adding delay ##

by time.sleep(number)

def companyname():
    for r in ws.rows:
        row_index = r[0].row
        Eng_name = r[1].value
        Kor_name = bs(Eng_name)

        #add delay after each crawl
        time.sleep(5) #sleeps for 5 seconds

        ws.cell(row=row_index, column=1).value = row_index
        ws.cell(row=row_index, column=2).value = Eng_name
        ws.cell(row=row_index, column=3).value = Kor_name
        wd.save("Done.xlsx")
    wd.close()
Morse
  • 8,258
  • 7
  • 39
  • 64
0

As others mentioned, it's because there was no user-agent specified. The default requests user-agent is python-requests and urllib has something similar thus Google blocks a request because it knows that it's a bot and not a "real" user visit. User-agent fakes user visit by adding this information into HTTP request headers. It can be done by using requests library and adding custom headers.

>>> params = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.get('https://httpbin.org/get', params=params)

>>> print(r.url)
# https://httpbin.org/get?key2=value2&key1=value1

Alternatively, you can achieve the same thing by using the Google Organic Results API from SerpApi. It's a paid API with a free plan.

The difference in your case is that you don't have to deal with such problems or maintain the parser over time since it's already done for the end-user, and the only thing that only needs to be done is to iterate over structured JSON and get the data you want fast.

Code to integrate:

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",              # search engine to search from
    "q": "what does katana mean",    # query
    "hl": "en",                      # language
    "gl": "us",                      # country to search from
    "api_key": os.getenv("API_KEY"), # API key environment
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
  print(f"Title: {result['title']}\nSummary: {result['snippet']}\nLink: {result['link']}\n")

---------
'''
Title: Katana - Wikipedia
Summary: Pronounced [katana], the kun'yomi (Japanese reading) of the kanji 刀, originally meaning dao or knife/saber in Chinese, the word has been adopted as a loanword ...
Link: https://en.wikipedia.org/wiki/Katana

Title: Definition of "katana" - Merriam-Webster
Summary: Katana definition is - a single-edged sword that is the longer of a pair worn by the Japanese samurai.
Link: https://www.merriam-webster.com/dictionary/katana
# other results ...
'''

Disclaimer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35