-3

I'm using a pattern to parse I wish to get all this from the following html elements using regex but my current attempt fetches me <DIV>I wish</DIV> to get all this instead.

This is how I tried:

import re

itemtxt = """
<TABLE>
    <TR>
        <TD><DIV>I wish</DIV></TD>
        <TD>to</TD>
        <TD>get all this</TD>
    </TR>
</TABLE>
"""
matches = re.findall(r">(.*)<", itemtxt)
print(' '.join(matches))

How can I parse I wish to get all this from the above html elements using regex?

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
SMTH
  • 67
  • 1
  • 4
  • 17

1 Answers1

0

Use the Beautiful Soup library for this. Do not use regex. Well OK, you actually can use regex here, but Soup is a better way to go.

itemtxt = """
<TABLE>
    <TR>
        <TD><DIV>I wish</DIV></TD>
        <TD>to</TD>
        <TD>get all this</TD>
    </TR>
</TABLE>
"""
matches = re.findall(r'<[^>]+>((?!<[^>]+>).*?)</[^>]+>', itemtxt)
print(' '.join(matches))

This prints:

I wish to get all this

The regex pattern uses a tempered dot to match only the innermost content in the case of nested HTML tags. Here is a brief explanation:

<[^>]+>           match an opening HTML tag
((?!<[^>]+>).*?)  match any content without crossing over
                  another opening HTML tag, before reaching
</[^>]+>          a closing HTML tag

Then, we join the matched words together by space.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • Thanks for your solution Tim. It does work. Can I not go for `re.findall(r">(.*?)<", itemtxt)` as well? – SMTH Dec 31 '20 at 13:08
  • Well that would also pick up on some empty string matches, and it would also match content not at the deepest level. If you can accept that, then maybe that pattern would also work. – Tim Biegeleisen Dec 31 '20 at 13:11