5

I am writing a crawler to get certain parts of a html file. But I cannot figure out how to use re.findall().

Here is an example, when I want to find all ... part in the file, I may write something like this:

re.findall("<div>.*\</div>", result_page)

if result_page is a string "<div> </div> <div> </div>", the result will be

['<div> </div> <div> </div>']

Only the entire string. This is not what I want, I am expecting the two divs separately. What should I do?

vaultah
  • 44,105
  • 12
  • 114
  • 143
alvinzoo
  • 493
  • 2
  • 8
  • 17

2 Answers2

7

Quoting the documentation,

The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.

Just add the question mark:

In [6]: re.findall("<div>.*?</div>", result_page)
Out[6]: ['<div> </div>', '<div> </div>']

Also, you shouldn't use RegEx to parse HTML, since there're HTML parsers made exactly for that. Example using BeautifulSoup 4:

In [7]: import bs4

In [8]: [str(tag) for tag in bs4.BeautifulSoup(result_page)('div')]
Out[8]: ['<div> </div>', '<div> </div>']
vaultah
  • 44,105
  • 12
  • 114
  • 143
  • Why shouldn't I use RegEx to parse HTML? What is the correct way? – alvinzoo Apr 26 '15 at 04:43
  • @alvinzoo there're always HTML parsers e.g. Beautiful Soup for Python. You might want to read [this famous question](http://stackoverflow.com/q/1732348/2301450). – vaultah Apr 26 '15 at 04:45
5

* is a greedy operator, you want to use *? for a non-greedy match.

re.findall("<div>.*?</div>", result_page)

Or use a parser such as BeautifulSoup instead of regular expression for this task:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
soup.find_all('div')
hwnd
  • 69,796
  • 4
  • 95
  • 132