2

I feel kind of stupid asking this but I have made a few regular expressions to find specific businesses, addresses, and URLs in an HTML document. The problem is...I don't know which (python) regular expression thing I should use. When I use re.findall, I get 30 to 90 results. I want to limit it to 3 or maybe 5 (one set number). Which regex operation should I use to do this, or is there a parameter that can stop the search when it has reached a certain number of results?

Also, is there a faster way of searching an HTML document so that my program isn't slowed down with regular expressions searching this really long "string" of text?

Thanks.

EDIT

I have Beautiful Soup and I've used it to just make things easier to read...but not to parse.

I've also used lxml...which is better/faster?

Marcus Johnson
  • 2,505
  • 6
  • 22
  • 27
  • 3
    my bad for posting an answer, you [shouldn't parse HTML with regex](http://stackoverflow.com/a/1732454/1219006), use a parser – jamylak Aug 10 '12 at 13:14
  • what about get html page and read it line by line with regexp, if i understand you correctly you read whole html page? correct me if i wrong, i can describe how i parse the page with regexp if you need. – Ishikawa Yoshi Aug 10 '12 at 13:26

1 Answers1

1

Instead of using re.findall, use re.finditer. It returns an iterator which yields the next match on demand.

Here's an example:

>>> [m.group(0) for m, _ in zip(re.finditer(r"\w", "abcdef"), range(3))]
['a', 'b', 'c']
MRAB
  • 20,356
  • 6
  • 40
  • 33