python RE findall() return value is an entire string

Question

I am writing a crawler to get certain parts of a html file. But I cannot figure out how to use re.findall().

Here is an example, when I want to find all ... part in the file, I may write something like this:

re.findall("<div>.*\</div>", result_page)

if result_page is a string "<div> </div> <div> </div>", the result will be

['<div> </div> <div> </div>']

Only the entire string. This is not what I want, I am expecting the two divs separately. What should I do?

If one of the answers below fixes your issue, you should accept it. — vaultah, Aug 26 '15 at 18:09

vaultah · Answer 1 · 2015-04-26T06:05:41.260

7

Quoting the documentation,

The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.

Just add the question mark:

In [6]: re.findall("<div>.*?</div>", result_page)
Out[6]: ['<div> </div>', '<div> </div>']

Also, you shouldn't use RegEx to parse HTML, since there're HTML parsers made exactly for that. Example using BeautifulSoup 4:

In [7]: import bs4

In [8]: [str(tag) for tag in bs4.BeautifulSoup(result_page)('div')]
Out[8]: ['<div> </div>', '<div> </div>']

edited Apr 26 '15 at 06:05

answered Apr 26 '15 at 04:31

vaultah

44,105
12
114
143

Why shouldn't I use RegEx to parse HTML? What is the correct way? – alvinzoo Apr 26 '15 at 04:43
@alvinzoo there're always HTML parsers e.g. Beautiful Soup for Python. You might want to read [this famous question](http://stackoverflow.com/q/1732348/2301450). – vaultah Apr 26 '15 at 04:45

hwnd · Answer 2 · 2015-04-26T05:39:29.007

5

* is a greedy operator, you want to use *? for a non-greedy match.

re.findall("<div>.*?</div>", result_page)

Or use a parser such as BeautifulSoup instead of regular expression for this task:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
soup.find_all('div')

edited Apr 26 '15 at 05:39

answered Apr 26 '15 at 04:32

hwnd

69,796
4
95
132

python RE findall() return value is an entire string

2 Answers2