Need Assistance with a regex pattern in Python – Parsing complex HTML structures

Question

I'm trying to parse complex HTML structures using Python's re module, and I've run into a roadblock with my regex pattern. Here's what I'm trying to do:

I have HTML text that contains nested elements, and I want to extract the content of the innermost tags. However, I can't seem to get my regex pattern right. Here's the code I'm using:

import re

html_text = """
<div>
    <div>
        <div>
            Innermost Content 1
        </div>
    </div>
    <div>
        Innermost Content 2
    </div>
</div>
"""

pattern = r'<div>(.*?)<\/div>'
result = re.findall(pattern, html_text, re.DOTALL)

print(result)

I expected this code to return the content of the innermost elements, like this:

['Innermost Content 1', 'Innermost Content 2']

But it's not working as expected. What am I doing wrong with my regex pattern, and how can I fix it to achieve the desired result? Any help would be greatly appreciated!

From the moment the HTML starts getting more complicated (unnecessary spaces, attributes, comments, javascript code...), [this can become very tricky](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). I'd recommend using a proper HTML parsing library. — The_spider, Sep 02 '23 at 13:25

score 1 · Answer 1 · answered Sep 02 '23 at 13:16

Try this modified code with changed pattern and an extra line to get rid of the \n

import re

html_text = """
<div>
    <div>
        <div>
            Innermost Content 1
        </div>
    </div>
    <div>
        Innermost Content 2
    </div>
</div>
"""

pattern = r'<div>([^<]*?)<\/div>'
result = re.findall(pattern, html_text, re.DOTALL)

result = [content.strip() for content in result if content.strip()]

print(result)

score 0 · Answer 2 · answered Sep 02 '23 at 13:15

you can use this:

[re.sub(r'<div>|<\/div>|\s+', '', item) for item in result]

also you can use a proper HTML parsing library like BeautifulSoup instead:

# Parse the HTML with BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

# Find all <div> elements and extract their text
div_elements = soup.find_all('div')
for div in div_elements:
    print(div.get_text())

score 0 · Answer 3 · answered Sep 02 '23 at 13:55

0

One can use re.split()

print([st.strip() for st in re.split(r'<div>\n?|<.div>\n?|\n', html_text) if not st.isspace() and st])

['Innermost Content 1', 'Innermost Content 2']

answered Sep 02 '23 at 13:55

LetzerWille

5,355
4
23
26

Need Assistance with a regex pattern in Python – Parsing complex HTML structures

3 Answers3