-1

I am looking for a regex pattern that will help me get from this messy HTML that is parsed with wrong tags:

<dt>Released
<dd>2019-02-13 <dt>First review
<dd>2019-02-13
<dt>Age
<dd>
914 days (2.5 years)
</dd></dt></dd></dt></dd></dt>

To this:

2019-02-13
2019-02-13
914 days (2.5 years)

Now it seems that the easiest way to approach this is to convert the original class to string and use regex to extract the correct values. I wonder what regex should I use.

ggorlen
  • 44,755
  • 7
  • 76
  • 106

2 Answers2

5

Don't use regex to parse HTML. Use an HTML parser:

>>> html = '''<dt>Released
... <dd>2019-02-13 <dt>First review
... <dd>2019-02-13
... <dt>Age
... <dd>
... 914 days (2.5 years)
... </dd></dt></dd></dt></dd></dt>'''
>>> from bs4 import BeautifulSoup
>>> [x.text for x in BeautifulSoup(html, "lxml").find_all("dd")]
['2019-02-13 ', '2019-02-13\n', '\n914 days (2.5 years)\n']

(use x.text.strip() if you don't want whitespace)

If it's not clear what's going on here, the HTML parser (lxml) actually fixes the HTML for you (amazing!!):

>>> BeautifulSoup(html, "lxml")
<html><body><dt>Released
</dt><dd>2019-02-13 </dd><dt>First review
</dt><dd>2019-02-13
</dd><dt>Age
</dt><dd>
914 days (2.5 years)
</dd></body></html>

Not the case with the builtin html.parser:

>>> BeautifulSoup(html, "html.parser")
<dt>Released
<dd>2019-02-13 <dt>First review
<dd>2019-02-13
<dt>Age
<dd>
914 days (2.5 years)
</dd></dt></dd></dt></dd></dt>
>>> [x.text.strip() for x in BeautifulSoup(html, "html.parser").find_all("dd")]
['2019-02-13 First review\n2019-02-13\nAge\n\n914 days (2.5 years)', '2019-02-13
\nAge\n\n914 days (2.5 years)', '914 days (2.5 years)']
ggorlen
  • 44,755
  • 7
  • 76
  • 106
-1
(\d\d\d\d-\d\d-\d\d)\D*(\d\d\d\d-\d\d-\d\d)\D*(\d* days \(.*\))

This may help you. It captures the:

  • $1 and $2: The First and second dates with the dddd-dd-dd (d for digit) format.

  • $3: The string with anynumber days (anything in the parentheses) format.

You can check it with Regex101 website: regex in action from regex101 website

Shahriar
  • 1,855
  • 2
  • 21
  • 45