0

I'm trying to extract parapgraphs from HTML by using the following line of code:

paragraphs = re.match(r'<p>.{1,}</p>', html)

but it returns none even though I know there is. Why?

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Curious
  • 107
  • 1
  • 2
  • 5
  • 4
    Argh!!! [Stop using regular expressions to parse HTML](http://stackoverflow.com/a/1732454/364696). We have proper HTML and XML parsers for this. For HTML, use [`HTMLParser`](https://docs.python.org/2/library/htmlparser.html) (Py2) or [`html.parser`](https://docs.python.org/3/library/html.parser.html) (Py3) or a third party package like BeautifulSoup. If it's true XHTML, you can use a more strict parser that can parse iteratively, e.g. [`xml.etree.ElementTree.iterparse`](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse). DON'T USE REGEXES! – ShadowRanger Dec 29 '15 at 01:46
  • 2
    ...Horrible use of regex aside, isn't `.{1,}` just "any character one or more times"? Why on Earth wouldn't you use `.+` for that? – jpmc26 Dec 29 '15 at 01:52

3 Answers3

11

Why don't use an HTML parser to, well, parse HTML. Example using BeautifulSoup:

>>> from bs4 import BeautifulSoup
>>> 
>>> data = """
...     <div>
...         <p>text1</p>
...         <p></p>
...         <p>text2</p>
...     </div>
... """
>>> soup = BeautifulSoup(data, "html.parser")
>>> [p.get_text() for p in soup.find_all("p", text=True)]
[u'text1', u'text2']

Note that text=True helps to filter out empty paragraphs.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Note that extracting paragraphs with BeautifulSoup can be about two orders of magnitude slower than a regexp-based solution. This is something to keep in mind if performance is important for your application. – jamix Apr 13 '21 at 15:27
6

Make sure you use re.search (or re.findall) instead of re.match, which attempts to match the entire html string (your html is definitely not beginning and ending with <p> tags).

Should also note that currently your search is greedy meaning it will return everything between the first <p> tag and the last </p> which is something you definitely do not want. Try

re.findall(r'<p(\s.*?)?>(.*?)</p>', response.text, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL)

instead. The question mark will make your regex stop matching at the first closing </p> tag, and findall will return multiple matches compared to search.

jamix
  • 5,484
  • 5
  • 26
  • 35
Martin Konecny
  • 57,827
  • 19
  • 139
  • 159
2

You should be using re.search instead of re.match. The former will search the entire string whereas the latter will only match if the pattern is at the beginning of the string.

That said, regular expressions are a horrible tool for parsing HTML. You will hit a wall with them very shortly. I strongly recommend you look at HTMLParser or BeautifulSoup for your task.

Turn
  • 6,656
  • 32
  • 41