Scraping data from multiple links on a webpage

Question

I am attempting to gather data on the cost of living index for all the cities in Texas from the website below: http://www.city-data.com/city/Texas.html

What would be the easiest way to scrape the data from the webpage? I have tried using a Chrome extension called Web Scraper, but was not successful. I am thinking it might work better with R using the XML package or trying Scrapy out. I looked up both approaches but am somewhat lost, and was looking for some direction. Any input would be helpful.

koffein · Accepted Answer · 2014-01-30T08:28:34.780

You can use BeautifulSoup4 (bs4) to parse and read HTML-Data. Please have a look at this example:

In [4]: from urllib2 import urlopen

In [5]: citylinkpage = urlopen("http://www.city-data.com/city/Texas.html")

In [7]: from bs4 import BeautifulSoup as BS

In [8]: soup = BS(citylinkpage)

In [9]: allImportantLinks = soup.select('table.cityTAB td.ph a')

In [10]: print allImportantLinks[:10]
[<a href='javascript:l("Abbott");'>Abbott</a>, <a href='javascript:l("Abernathy");'>Abernathy</a>, <a href="Abilene-Texas.html">Abilene, TX</a>, <a href="Addison-Texas.html">Addison, TX</a>, <a href="Alamo-Heights-Texas.html">Alamo Heights</a>, <a href='javascript:l("Albany");'>Albany, TX</a>, <a href="Alice-Texas.html">Alice</a>, <a href="Allen-Texas.html">Allen, TX</a>, <a href='javascript:l("Alma");'>Alma, TX</a>, <a href="Alpine-Texas.html">Alpine, TX</a>]

In [14]: allCityUrls = ["http://www.city-data.com/city/"+a.get('href') for a in allImportantLinks if not a.get('href').startswith('javascript:')]

In [15]: allCityUrls
Out[15]: 
['http://www.city-data.com/city/Abilene-Texas.html',
 'http://www.city-data.com/city/Addison-Texas.html',
 'http://www.city-data.com/city/Alamo-Heights-Texas.html',
 'http://www.city-data.com/city/Alice-Texas.html',
 'http://www.city-data.com/city/Allen-Texas.html',
 'http://www.city-data.com/city/Alpine-Texas.html',
 'http://www.city-data.com/city/Amarillo-Texas.html',
...

Because the Page of each city seems to be bad HTML (especially around this index) it seems better to search the page via regular expressions (using built-in re-module)

cityPageAdress = "http://www.city-data.com/city/Abilene-Texas.html"
pageSourceCode = urlopen(cityPageAdress).read()
import re
expr = re.compile(r"cost of living index in .*?:</b>\s*(\d+(\.\d+)?)\s*<b>")
print expr.findall(pageSourceCode)[0][0]
Out: 83.5

What does the #pageSourceCode = ... refer to in the second bit of the code? I was not able to run that chunk of code. — stochasticcrap, Jan 30 '14 at 05:12
Is there a way to prevent the website from crashing when I run the code? I tried running the code with the url list partitioned in sets of 400, but it still crashes. Any ideas? error: urllib2.URLError: — stochasticcrap, Jan 30 '14 at 22:00
@user11235813: http://stackoverflow.com/questions/5620263/using-an-http-proxy-python — koffein, Feb 01 '14 at 18:32
Thanks for the help. I have one more question. Is it possible to extract the county name of the city using a regular expression? — stochasticcrap, Feb 02 '14 at 22:00
This looks like a case for beatifulsoup4: I look for every link that starts with "/county", take the first one and take its text: `soup.select('a[href^=/county]')[0].getText()` — koffein, Feb 03 '14 at 01:18

score 0 · Answer 2 · answered Jan 30 '14 at 01:21

0

Give scrapy a try. Check out my blog post on recursive scraping

answered Jan 30 '14 at 01:21

Philip Adzanoukpe

141
1
3
6

Scraping data from multiple links on a webpage

2 Answers2