0

I am attempting to gather data on the cost of living index for all the cities in Texas from the website below: http://www.city-data.com/city/Texas.html

What would be the easiest way to scrape the data from the webpage? I have tried using a Chrome extension called Web Scraper, but was not successful. I am thinking it might work better with R using the XML package or trying Scrapy out. I looked up both approaches but am somewhat lost, and was looking for some direction. Any input would be helpful.

Artjom B.
  • 61,146
  • 24
  • 125
  • 222
stochasticcrap
  • 350
  • 3
  • 16

2 Answers2

1

You can use BeautifulSoup4 (bs4) to parse and read HTML-Data. Please have a look at this example:

In [4]: from urllib2 import urlopen

In [5]: citylinkpage = urlopen("http://www.city-data.com/city/Texas.html")

In [7]: from bs4 import BeautifulSoup as BS

In [8]: soup = BS(citylinkpage)

In [9]: allImportantLinks = soup.select('table.cityTAB td.ph a')

In [10]: print allImportantLinks[:10]
[<a href='javascript:l("Abbott");'>Abbott</a>, <a href='javascript:l("Abernathy");'>Abernathy</a>, <a href="Abilene-Texas.html">Abilene, TX</a>, <a href="Addison-Texas.html">Addison, TX</a>, <a href="Alamo-Heights-Texas.html">Alamo Heights</a>, <a href='javascript:l("Albany");'>Albany, TX</a>, <a href="Alice-Texas.html">Alice</a>, <a href="Allen-Texas.html">Allen, TX</a>, <a href='javascript:l("Alma");'>Alma, TX</a>, <a href="Alpine-Texas.html">Alpine, TX</a>]

In [14]: allCityUrls = ["http://www.city-data.com/city/"+a.get('href') for a in allImportantLinks if not a.get('href').startswith('javascript:')]

In [15]: allCityUrls
Out[15]: 
['http://www.city-data.com/city/Abilene-Texas.html',
 'http://www.city-data.com/city/Addison-Texas.html',
 'http://www.city-data.com/city/Alamo-Heights-Texas.html',
 'http://www.city-data.com/city/Alice-Texas.html',
 'http://www.city-data.com/city/Allen-Texas.html',
 'http://www.city-data.com/city/Alpine-Texas.html',
 'http://www.city-data.com/city/Amarillo-Texas.html',
...

Because the Page of each city seems to be bad HTML (especially around this index) it seems better to search the page via regular expressions (using built-in re-module)

cityPageAdress = "http://www.city-data.com/city/Abilene-Texas.html"
pageSourceCode = urlopen(cityPageAdress).read()
import re
expr = re.compile(r"cost of living index in .*?:</b>\s*(\d+(\.\d+)?)\s*<b>")
print expr.findall(pageSourceCode)[0][0]
Out: 83.5
koffein
  • 1,792
  • 13
  • 21
  • What does the #pageSourceCode = ... refer to in the second bit of the code? I was not able to run that chunk of code. – stochasticcrap Jan 30 '14 at 05:12
  • Is there a way to prevent the website from crashing when I run the code? I tried running the code with the url list partitioned in sets of 400, but it still crashes. Any ideas? error: urllib2.URLError: – stochasticcrap Jan 30 '14 at 22:00
  • @user11235813: http://stackoverflow.com/questions/5620263/using-an-http-proxy-python – koffein Feb 01 '14 at 18:32
  • Thanks for the help. I have one more question. Is it possible to extract the county name of the city using a regular expression? – stochasticcrap Feb 02 '14 at 22:00
  • 1
    This looks like a case for beatifulsoup4: I look for every link that starts with "/county", take the first one and take its text: `soup.select('a[href^=/county]')[0].getText()` – koffein Feb 03 '14 at 01:18
0

Give scrapy a try. Check out my blog post on recursive scraping

Philip Adzanoukpe
  • 141
  • 1
  • 3
  • 6