Can't scoop out some sporadic texts

Question

I'm using a pattern to parse I wish to get all this from the following html elements using regex but my current attempt fetches me <DIV>I wish</DIV> to get all this instead.

This is how I tried:

import re

itemtxt = """
<TABLE>
    <TR>
        <TD><DIV>I wish</DIV></TD>
        <TD>to</TD>
        <TD>get all this</TD>
    </TR>
</TABLE>
"""
matches = re.findall(r">(.*)<", itemtxt)
print(' '.join(matches))

How can I parse I wish to get all this from the above html elements using regex?

https://stackoverflow.com/a/1732454/548562 - there are libraries out there for parsing HTML, Beautiful soup being a good one — Iain Shelvington, Dec 31 '20 at 12:56
Use the Beautiful Soup library. Do _not_ use regex for this. — Tim Biegeleisen, Dec 31 '20 at 12:57
Does this answer your question? [Python, remove all html tags from string](https://stackoverflow.com/questions/37018475/python-remove-all-html-tags-from-string) — Ryszard Czech, Dec 31 '20 at 20:54

score 0 · Accepted Answer · answered Dec 31 '20 at 13:00

Use the Beautiful Soup library for this. Do not use regex. Well OK, you actually can use regex here, but Soup is a better way to go.

itemtxt = """
<TABLE>
    <TR>
        <TD><DIV>I wish</DIV></TD>
        <TD>to</TD>
        <TD>get all this</TD>
    </TR>
</TABLE>
"""
matches = re.findall(r'<[^>]+>((?!<[^>]+>).*?)</[^>]+>', itemtxt)
print(' '.join(matches))

This prints:

I wish to get all this

The regex pattern uses a tempered dot to match only the innermost content in the case of nested HTML tags. Here is a brief explanation:

<[^>]+>           match an opening HTML tag
((?!<[^>]+>).*?)  match any content without crossing over
                  another opening HTML tag, before reaching
</[^>]+>          a closing HTML tag

Then, we join the matched words together by space.

Thanks for your solution Tim. It does work. Can I not go for `re.findall(r">(.*?)<", itemtxt)` as well? — SMTH, Dec 31 '20 at 13:08
Well that would also pick up on some empty string matches, and it would also match content not at the deepest level. If you can accept that, then maybe that pattern would also work. — Tim Biegeleisen, Dec 31 '20 at 13:11

Can't scoop out some sporadic texts

1 Answers1