I'm trying to extract parapgraphs from HTML by using the following line of code:
paragraphs = re.match(r'<p>.{1,}</p>', html)
but it returns none even though I know there is. Why?
I'm trying to extract parapgraphs from HTML by using the following line of code:
paragraphs = re.match(r'<p>.{1,}</p>', html)
but it returns none even though I know there is. Why?
Why don't use an HTML parser to, well, parse HTML. Example using BeautifulSoup
:
>>> from bs4 import BeautifulSoup
>>>
>>> data = """
... <div>
... <p>text1</p>
... <p></p>
... <p>text2</p>
... </div>
... """
>>> soup = BeautifulSoup(data, "html.parser")
>>> [p.get_text() for p in soup.find_all("p", text=True)]
[u'text1', u'text2']
Note that text=True
helps to filter out empty paragraphs.
Make sure you use re.search
(or re.findall
) instead of re.match
, which attempts to match the entire html string (your html is definitely not beginning and ending with <p>
tags).
Should also note that currently your search is greedy meaning it will return everything between the first <p>
tag and the last </p>
which is something you definitely do not want. Try
re.findall(r'<p(\s.*?)?>(.*?)</p>', response.text, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL)
instead. The question mark will make your regex stop matching at the first closing </p>
tag, and findall
will return multiple matches compared to search
.
You should be using re.search
instead of re.match
. The former will search the entire string whereas the latter will only match if the pattern is at the beginning of the string.
That said, regular expressions are a horrible tool for parsing HTML. You will hit a wall with them very shortly. I strongly recommend you look at HTMLParser or BeautifulSoup for your task.