1

I have a file like this:

<table>
<span clas="city"> Miami </span> <span><a href="miami" > Miami </a> </span>
<span clas="city"> Orlando </span> <span><a href="orlando" > orlando </a></span>
</table>
<table>
<span clas="city"> Los Angeles </span> <span><a href="Los Angeles" > </a> </span>
<span clas="city"> San Diego </span>  <span><a href="Los Angeles" > San Diego</a> </span>
</table>

How can I extend this regex re.compile('city">([^<]+)</span>') to group cities belonging to same state(table) when a table ends (without a while loop), such as

State 1: Miami, Orlando
State 2: Los Angeles, San Diego
Emmet B
  • 5,341
  • 6
  • 34
  • 47
  • 1
    obligatory link: http://stackoverflow.com/a/1732454/1561176 – Inbar Rose Jan 30 '13 at 08:26
  • Why are you using regex to do that work ? "Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems." - Jamie Zawinski Just use lxml with html, there are tons of tutos on web for that kind of stuff and you won't go crazy ^^ – Ketouem Jan 30 '13 at 08:28
  • @Inbar well that took me a while, and I figured myself. I am able to get cities, but I was wondering if there is a neat way of categorizing. – Emmet B Jan 30 '13 at 08:31
  • If you really want to use regexes, then use two regexes: one to find the tables, then one to search for the cities within each text inside the tags.
    – Justin Peel Jan 30 '13 at 08:32

1 Answers1

3

Use a proper HTML parser:

from bs4 import BeautifulSoup
soup = BeautifulSoup(open(...).read())
states = {}
for i, table in enumerate(soup("table")):
    for city in table("span"):
        states.setdefault(i, []).append(city.text.strip())

which will give

states
{0: [u'Miami', u'Orlando'], 1: [u'Los Angeles', u'San Diego']}
Katriel
  • 120,462
  • 19
  • 136
  • 170
  • Thanks. I was gonna accept this answer, but I saw for some tables I am getting duplicates as some rows have multiple s. I updated the table structure. – Emmet B Jan 30 '13 at 08:45
  • 1
    Use `table("span", "city")` to search for only the `span` tags that have class `city`. You should read the docs for BeautifulSoup. – Katriel Jan 30 '13 at 08:59