Looking for a regex to parse a string of malformed HTML

Question

I am looking for a regex pattern that will help me get from this messy HTML that is parsed with wrong tags:

<dt>Released
<dd>2019-02-13 <dt>First review
<dd>2019-02-13
<dt>Age
<dd>
914 days (2.5 years)
</dd></dt></dd></dt></dd></dt>

To this:

2019-02-13
2019-02-13
914 days (2.5 years)

Now it seems that the easiest way to approach this is to convert the original class to string and use regex to extract the correct values. I wonder what regex should I use.

What code have you already tried? – BLimitless Aug 14 '21 at 19:50 — BLimitless, Aug 14 '21 at 19:50

ggorlen · Accepted Answer · 2021-08-14T20:09:17.227

Don't use regex to parse HTML. Use an HTML parser:

>>> html = '''<dt>Released
... <dd>2019-02-13 <dt>First review
... <dd>2019-02-13
... <dt>Age
... <dd>
... 914 days (2.5 years)
... </dd></dt></dd></dt></dd></dt>'''
>>> from bs4 import BeautifulSoup
>>> [x.text for x in BeautifulSoup(html, "lxml").find_all("dd")]
['2019-02-13 ', '2019-02-13\n', '\n914 days (2.5 years)\n']

(use x.text.strip() if you don't want whitespace)

If it's not clear what's going on here, the HTML parser (lxml) actually fixes the HTML for you (amazing!!):

>>> BeautifulSoup(html, "lxml")
<html><body><dt>Released
</dt><dd>2019-02-13 </dd><dt>First review
</dt><dd>2019-02-13
</dd><dt>Age
</dt><dd>
914 days (2.5 years)
</dd></body></html>

Not the case with the builtin html.parser:

>>> BeautifulSoup(html, "html.parser")
<dt>Released
<dd>2019-02-13 <dt>First review
<dd>2019-02-13
<dt>Age
<dd>
914 days (2.5 years)
</dd></dt></dd></dt></dd></dt>
>>> [x.text.strip() for x in BeautifulSoup(html, "html.parser").find_all("dd")]
['2019-02-13 First review\n2019-02-13\nAge\n\n914 days (2.5 years)', '2019-02-13
\nAge\n\n914 days (2.5 years)', '914 days (2.5 years)']

score -1 · Answer 2 · answered Aug 14 '21 at 20:27

(\d\d\d\d-\d\d-\d\d)\D*(\d\d\d\d-\d\d-\d\d)\D*(\d* days \(.*\))

This may help you. It captures the:

$1 and $2: The First and second dates with the dddd-dd-dd (d for digit) format.
$3: The string with anynumber days (anything in the parentheses) format.

You can check it with Regex101 website:

Looking for a regex to parse a string of malformed HTML

2 Answers2