2

I would like to automatically save the data of cities from this website:

http://www.dataforcities.org/

I used beautifulsoup library to get data from a webpage

http://open.dataforcities.org/details?4[]=2016

import urllib2
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://open.dataforcities.org/details?4[]=2016').read())

If I follow the example in Web scraping with Python I got the following error:

soup = BeautifulSoup(urllib2.urlopen('http://example.com').read())

for row in soup('table', {'class': 'metrics'})[0].tbody('tr'):
    tds = row('td')
    print tds[0].string, tds[1].string

IndexError                                Traceback (most recent call last)
<ipython-input-71-d688ff354182> in <module>()
----> 1 for row in soup('table', {'class': 'metrics'})[0].tbody('tr'):
      2     tds = row('td')
      3     print tds[0].string, tds[1].string

IndexError: list index out of range

  [1]: http://www.dataforcities.org/
  [2]: http://open.dataforcities.org/
  [3]: https://i.stack.imgur.com/qfQyG.png
Keyur Potdar
  • 7,158
  • 6
  • 25
  • 40
emax
  • 6,965
  • 19
  • 74
  • 141
  • 1
    Possible duplicate of [Web scraping with Python](https://stackoverflow.com/questions/2081586/web-scraping-with-python) – DJDaveMark Jan 18 '18 at 10:27

2 Answers2

2

From a quick look at the site, a good technique for this one would be to look at the requests being made by the JS on the page. It will reveal the internal API being used to gather the data to populate on the page.

For example, with a particular city, a GET request is made to http://open.dataforcities.org/city/109/themes/2017 which contains a JSON response containing many entries. You can get this yourself using requests

>>> import requests
>>> response = requests.get('http://open.dataforcities.org/city/109/themes/2017')
>>> response.json()
[{'theme': 'Economy', 'score': 108, 'date': '2015', 'rank': '2/9'}, {'theme': 'Education', 'score': 97, 'date': '2015', 'rank': '8/9'}, {'theme': 'Energy', 'score': 110, 'date': '2015', 'rank': '1/9'}, 

So, with a little work, you can likely discover all the endpoints you need to get the data you want. That's just one method. You could also use a browser automation tool like selenium -- Not just for automating browser actions like scrolling and clicking, but you can also execute arbitrary JavaScript and inspect the data from js, too.

from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://example.com/page/to/scrape')
value = driver.execute_script('return someThing.value;')

But before going through much trouble trying to scrape a site, you should always check if they have a documented public API available that you can use.

sytech
  • 29,298
  • 3
  • 45
  • 86
  • nice, where did you find the request example? – emax Jan 18 '18 at 10:36
  • Using my browser's dev tools Network tab. In most browsers, you bring this up with the `F12` key or by right-clicking anywhere and selecting the `inspect` (or similar) option. You should see a `Network` tab that will show you all the requests that are made between the browser and server. – sytech Jan 18 '18 at 10:39
  • Mozilla has a [good writeup for Firefox](https://developer.mozilla.org/en-US/docs/Tools/Network_Monitor) – sytech Jan 18 '18 at 10:41
  • Great I accepted your answer. I found that the data I want is in `html`. For instance for Boston I have the data at this link `http://open.dataforcities.org/details?23[]=2014`. How can I save this data in structured format such as a dataframe? – emax Jan 18 '18 at 11:20
1

You can scrape data from web site using Python, Beautifulsoup library help to clean up the html code and extract. Thare are other libraries also. Even NodeJs alsocan do the same this.

Main thing is your logic. Python and Beautifulsoup will gives you data. You have to analysis and save themin db.

Beautiful Soup Documentation

Other Requests, lxml, Selenium, Scrapy

Example

from bs4 import BeautifulSoup
import requests

page = requests.get("http://www.dataforcities.org/")
soup = BeautifulSoup(page.content, 'html.parser')


all_links = soup.find_all(("a")

Like above you can find anything. There are many functions. Tutorial web scraping tutorial
python and beautifulsoup

Better to check official documentation as well.

Sampath
  • 308
  • 2
  • 5
  • 15
  • 1
    Nice, I can not understand how it works. I know that I have to clean the data afterwords, but can you give an example? – emax Jan 18 '18 at 10:39
  • I updated the answer with example. check the tutorials as well. – Sampath Jan 19 '18 at 02:35