You can use BeautifulSoup4 (bs4) to parse and read HTML-Data.
Please have a look at this example:
In [4]: from urllib2 import urlopen
In [5]: citylinkpage = urlopen("http://www.city-data.com/city/Texas.html")
In [7]: from bs4 import BeautifulSoup as BS
In [8]: soup = BS(citylinkpage)
In [9]: allImportantLinks = soup.select('table.cityTAB td.ph a')
In [10]: print allImportantLinks[:10]
[<a href='javascript:l("Abbott");'>Abbott</a>, <a href='javascript:l("Abernathy");'>Abernathy</a>, <a href="Abilene-Texas.html">Abilene, TX</a>, <a href="Addison-Texas.html">Addison, TX</a>, <a href="Alamo-Heights-Texas.html">Alamo Heights</a>, <a href='javascript:l("Albany");'>Albany, TX</a>, <a href="Alice-Texas.html">Alice</a>, <a href="Allen-Texas.html">Allen, TX</a>, <a href='javascript:l("Alma");'>Alma, TX</a>, <a href="Alpine-Texas.html">Alpine, TX</a>]
In [14]: allCityUrls = ["http://www.city-data.com/city/"+a.get('href') for a in allImportantLinks if not a.get('href').startswith('javascript:')]
In [15]: allCityUrls
Out[15]:
['http://www.city-data.com/city/Abilene-Texas.html',
'http://www.city-data.com/city/Addison-Texas.html',
'http://www.city-data.com/city/Alamo-Heights-Texas.html',
'http://www.city-data.com/city/Alice-Texas.html',
'http://www.city-data.com/city/Allen-Texas.html',
'http://www.city-data.com/city/Alpine-Texas.html',
'http://www.city-data.com/city/Amarillo-Texas.html',
...
Because the Page of each city seems to be bad HTML (especially around this index) it seems better to search the page via regular expressions (using built-in re
-module)
cityPageAdress = "http://www.city-data.com/city/Abilene-Texas.html"
pageSourceCode = urlopen(cityPageAdress).read()
import re
expr = re.compile(r"cost of living index in .*?:</b>\s*(\d+(\.\d+)?)\s*<b>")
print expr.findall(pageSourceCode)[0][0]
Out: 83.5