-1

I'm new to regex, so I hope this isn't too obvious a question

I'm looking for the neighborhood in craigslist apartment listing's html. The neighborhood is listed like this

(castro / upper market)
</h2>

And here is an example of the html...

<a class="backup" disabled="disabled">&#9650;</a>
<a class="next" disabled="disabled"> next &#9654;</a>
</span>

</section>

<h2 class="postingtitle">
<span class="star"></span>
&#x0024;5224 / 2br - Stunning Furnished 2BR with Hardwwod Floors &amp; Newly  renovated Kitchen (pacific heights)
</h2>
<section class="userbody">
<figure class="iw">


<div class="slidernav">
    <button class="sliderback">&lt;</button>
    <span class="sliderinfo"></span>
    <button class="sliderforward">&gt;</button>

This should find all the different neighborhoods

But it takes way too long on a full page of html

\w+\s?(\/)?\s?\w+\s?(\/)?\s?\w+\s?(\/)?\s?\w+\)\n<\/h2>

# \w+ to find the word 
# \s?(\/)?\s? for a space or space, forward slash, space
# \n<\/h2> because </h2> is uniquely next to the neighborhood in the html

Is there a way to find

</h2>

Then look behind for the neighborhood string of text?

Thanks so much for any help or steering me in the right direction

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
David Feldman
  • 349
  • 1
  • 3
  • 10
  • Using regex for html is not very good idea ([more here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)). Use proper tool like http://scrapy.org/ for example. – Marcin Jan 16 '15 at 23:14

3 Answers3

2

Use an HTML Parser to extract the title (h2 tag contents) and then use regular expressions to extract the neighborhood (text inside the parenthesis).

Example (using BeautifulSoup HTML parser):

import re
from bs4 import BeautifulSoup
import requests

response = requests.get('http://sfbay.craigslist.org/sfc/apa/4849806764.html')
soup = BeautifulSoup(response.content)

pattern = re.compile(r'\((.*?)\)$')
text = soup.find('h2', class_='postingtitle').text.strip()
print pattern.search(text).group(1)

Prints pacific heights.

Note the \((.*?)\)$ regular expression - it would capture everything inside the parenthesis that is directly before the end of the string.


With Scrapy web-scraping framework you can solve it in one line since Selectors have built-in support for regular expressions. Example from the "Scrapy shell":

$ scrapy shell http://sfbay.craigslist.org/sfc/apa/4849806764.html
In [1]: response.xpath('//h2[@class="postingtitle"]/text()').re(r'\((.*?)\)$')[0]
Out[1]: u'pacific heights'

Also see hundred reasons why regex should not be used for HTML parsing:

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • you're probably right, it's currently taking 6 seconds to go through about 5000 listings, pulling out about 20 features so far. i'll take a closer look at this when i have time to rework all of it – David Feldman Jan 16 '15 at 23:31
  • @DavidFeldman sure, start looking into scrapy and get your code organized and modular by having a scrapy project with spiders, items and pipelines. – alecxe Jan 16 '15 at 23:32
  • Actually, *parsing* HTML and *extracting* content from a (web-) page are not exactly the same thing. While you should not *parse* HTML with regular expressions, for this particular case, extraction with a well-crafted RE is probably more than an *order of magnitude* faster, I would dare betting. – fnl Jan 16 '15 at 23:53
  • @fnl this is just if we are talking about the speed of extracting the text from a single page. What about readability, complexity, reliability etc? There are specific formats and specialized tools that are made specifically to parse these formats, tested and used by an enormous amount of users, proven to work. – alecxe Jan 17 '15 at 00:12
  • @fnl I wanted to say much more than that, but I've calmed down :) This is your opinion, let's not argue here. – alecxe Jan 17 '15 at 00:14
  • 1
    @alecxe Hehe, sure. I appreciate your sudden Husserlian take on the matter :) – fnl Jan 17 '15 at 00:34
1

What about using string.find to find the regex index and then going back negative value at that index.

 In [1]: import re

 In [2]: c = "123456</h2>7890"

 In [3]: x = c.find("</h2>")

 In [4]: print c[x-6:x]
 123456
cmidi
  • 1,880
  • 3
  • 20
  • 35
0

Assuming your HTML is stored in a variable called page, how about this pattern?

re.findall("\(([^\(\)]+)\)\n<\/h2>", page)

For good measure, allow for extra spaces, too:

re.findall("\(([^\(\)]+)\)\s*\n\s*<\/h2>", page)

Finally, precompile the automaton:

neighborhoods = re.compile( "\(([^\(\)]+)\)\s*\n\s*<\/h2>")

# somewhere else, for each page 
for nh in neighborhoods.findall(page):
    print(nh)

For your example HTML page, this prints the following list of the only neighborhood in there:

pacific heights

If you only have one location per page, re.search() for it would be even faster. Just remember that search() produces an intermediary match object, not the string itself.

fnl
  • 4,861
  • 4
  • 27
  • 32