1

I'm trying to parse complex HTML structures using Python's re module, and I've run into a roadblock with my regex pattern. Here's what I'm trying to do:

I have HTML text that contains nested elements, and I want to extract the content of the innermost tags. However, I can't seem to get my regex pattern right. Here's the code I'm using:

import re

html_text = """
<div>
    <div>
        <div>
            Innermost Content 1
        </div>
    </div>
    <div>
        Innermost Content 2
    </div>
</div>
"""

pattern = r'<div>(.*?)<\/div>'
result = re.findall(pattern, html_text, re.DOTALL)

print(result)

I expected this code to return the content of the innermost elements, like this:

['Innermost Content 1', 'Innermost Content 2']

But it's not working as expected. What am I doing wrong with my regex pattern, and how can I fix it to achieve the desired result? Any help would be greatly appreciated!

prabu naresh
  • 405
  • 1
  • 10
  • 2
    From the moment the HTML starts getting more complicated (unnecessary spaces, attributes, comments, javascript code...), [this can become very tricky](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). I'd recommend using a proper HTML parsing library. – The_spider Sep 02 '23 at 13:25
  • Why don't you just use the built-in HTMLParser? – OneMadGypsy Sep 02 '23 at 14:36

3 Answers3

1

Try this modified code with changed pattern and an extra line to get rid of the \n

import re

html_text = """
<div>
    <div>
        <div>
            Innermost Content 1
        </div>
    </div>
    <div>
        Innermost Content 2
    </div>
</div>
"""

pattern = r'<div>([^<]*?)<\/div>'
result = re.findall(pattern, html_text, re.DOTALL)

result = [content.strip() for content in result if content.strip()]

print(result)
smoks
  • 103
  • 5
0

you can use this:

[re.sub(r'<div>|<\/div>|\s+', '', item) for item in result]

also you can use a proper HTML parsing library like BeautifulSoup instead:

# Parse the HTML with BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

# Find all <div> elements and extract their text
div_elements = soup.find_all('div')
for div in div_elements:
    print(div.get_text())
0

One can use re.split()

print([st.strip() for st in re.split(r'<div>\n?|<.div>\n?|\n', html_text) if not st.isspace() and st])

['Innermost Content 1', 'Innermost Content 2']
LetzerWille
  • 5,355
  • 4
  • 23
  • 26