0

I want to crawl the data site. But something wrong in my code

I want to find why find object is wrong and search in stack overflow but i can't find what's wrong in this code

from bs4 import BeautifulSoup
from pymongo import MongoClient
import requests
from matplotlib import font_manager, rc

client = MongoClient("localhost", 27017)
database = client.datadb
collection = database.datacol

page = requests.get("https://www.worlddata.info/average-income.php")

soup = BeautifulSoup(page.content, 'html.parser')

general_list = soup.find("tr")

#list_of_tr = general_list.find("tr")

for in_each_tr in general_list:
    list_of_td0 = general_list.find_all("td")[0]
    list_of_td1 = general_list.find_all("td")[1]
    general_list = collection.insert_one({"country":list_of_td0.get_text(), "income":list_of_td1.get_text()})


Traceback (most recent call last):
  File "C:/Users/SAMSUNG/PycharmProjects/simple/data.py", line 18, in <module>
    for in_each_tr in general_list:
TypeError: 'NoneType' object is not iterable
이명철
  • 3
  • 3
  • What is error you are getting? update the same in question. – shaik moeed May 24 '19 at 05:59
  • Maybe this just happens to me, but `requests.get("https://www.worlddata.info/average-income.php")` gives me the response [403](https://en.wikipedia.org/wiki/HTTP_403), meaning access to the site is forbidden. – funie200 May 24 '19 at 06:18

3 Answers3

0

Your general_list is none value.

you need to add validation before you do actions on an object.

I'm assuming this address is returning a forbidden error hence the response has no <tr>'s.

If you change the address to:

page = requests.get("https://www.google.com")

soup = BeautifulSoup(page.content, 'html.parser')

general_list = soup.find("tr")

for tr in general_list: 
    print(tr)

It works.

Tom Slabbaert
  • 21,288
  • 10
  • 30
  • 43
0

Website is loading data by ajax request, so you need to use selenium to download dynamic content.

First of install selenium web driver as per your browser.

import selenium web driver

from selenium import webdriver

Download web content

driver = webdriver.Chrome("/usr/bin/chromedriver")
driver.get('https://www.worlddata.info/average-income.php')

Where "/usr/bin/chromedriver" webdriver path

Get html content

soup = BeautifulSoup(driver.page_source, 'lxml')

Now you will get tr tag object

general_list = soup.find("tr")
bharatk
  • 4,202
  • 5
  • 16
  • 30
0

It seems that requests.get("https://www.worlddata.info/average-income.php") gives 403 as a response, meaning that access to the web page is forbidden.

I did some googling and found this StackOverflow post. It says that some web pages can reject GETrequests that do not identify a User-Agent.

If you add a header to requests.get like so:

header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
page = requests.get("https://www.worlddata.info/average-income.php", headers=header)

Then the response of the GET request will be 200, and your code should work as expected.

funie200
  • 3,688
  • 5
  • 21
  • 34