Python regex to extract html paragraph

Question

I'm trying to extract parapgraphs from HTML by using the following line of code:

paragraphs = re.match(r'<p>.{1,}</p>', html)

but it returns none even though I know there is. Why?

Argh!!! [Stop using regular expressions to parse HTML](http://stackoverflow.com/a/1732454/364696). We have proper HTML and XML parsers for this. For HTML, use [`HTMLParser`](https://docs.python.org/2/library/htmlparser.html) (Py2) or [`html.parser`](https://docs.python.org/3/library/html.parser.html) (Py3) or a third party package like BeautifulSoup. If it's true XHTML, you can use a more strict parser that can parse iteratively, e.g. [`xml.etree.ElementTree.iterparse`](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse). DON'T USE REGEXES! — ShadowRanger, Dec 29 '15 at 01:46
...Horrible use of regex aside, isn't `.{1,}` just "any character one or more times"? Why on Earth wouldn't you use `.+` for that? — jpmc26, Dec 29 '15 at 01:52

score 11 · Accepted Answer · edited May 23 '17 at 12:23

11

Why don't use an HTML parser to, well, parse HTML. Example using BeautifulSoup:

>>> from bs4 import BeautifulSoup
>>> 
>>> data = """
...     <div>
...         <p>text1</p>
...         <p></p>
...         <p>text2</p>
...     </div>
... """
>>> soup = BeautifulSoup(data, "html.parser")
>>> [p.get_text() for p in soup.find_all("p", text=True)]
[u'text1', u'text2']

Note that text=True helps to filter out empty paragraphs.

edited May 23 '17 at 12:23

Community

1
1

answered Dec 29 '15 at 01:44

alecxe

462,703
120
1,088
1,195

Note that extracting paragraphs with BeautifulSoup can be about two orders of magnitude slower than a regexp-based solution. This is something to keep in mind if performance is important for your application. – jamix Apr 13 '21 at 15:27

score 6 · Answer 2 · edited Apr 13 '21 at 15:34

Make sure you use re.search (or re.findall) instead of re.match, which attempts to match the entire html string (your html is definitely not beginning and ending with <p> tags).

Should also note that currently your search is greedy meaning it will return everything between the first <p> tag and the last </p> which is something you definitely do not want. Try

re.findall(r'<p(\s.*?)?>(.*?)</p>', response.text, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL)

instead. The question mark will make your regex stop matching at the first closing </p> tag, and findall will return multiple matches compared to search.

score 2 · Answer 3 · answered Dec 29 '15 at 01:40

2

You should be using re.search instead of re.match. The former will search the entire string whereas the latter will only match if the pattern is at the beginning of the string.

That said, regular expressions are a horrible tool for parsing HTML. You will hit a wall with them very shortly. I strongly recommend you look at HTMLParser or BeautifulSoup for your task.

answered Dec 29 '15 at 01:40

Turn

6,656
32
41

This for recommending a real DOM parser. – Martin Konecny Dec 29 '15 at 01:45

Python regex to extract html paragraph

3 Answers3