Regular Expression in involving AND with python

Question

I've been struggling for a while trying to get the right regular expression for the following task:

I want to strip data from table tags in an html file using python. To do this, my approach is to recursively do the following (with the HTML line between the tags stored as a string):

s = "<td ...>Desired Content</td>"

Re-assign the string, s to the string with everything between '<...>' removed.

s = re.sub('<{1}(not '<' and not '>').*>{1}', '', s)

Repeat until left with s = "Desired Content".

My question is how to fulfill the part bolded in parentheses. Thank you.your text

I tried

import re

test_str = '<td style="color:blue">Hello</td>'
test_str = re.sub('<{1}^[<>].*>{1}','',test_str)
print(test_str)

which you can see leaves my test string unchanged. What am I doing wrong?

The above code I expect to give me test_str = "Hello</td>", which I would them feed back to this method to then extract "</td>", giving me "Hello".

`{1}` is a meaningless quantifier. – CAustin May 17 '23 at 23:43 — CAustin, May 17 '23 at 23:43

score -1 · Answer 1 · answered May 17 '23 at 23:33

-1

To negate a character class, ^ should be placed after [. In addition, you don't need to specify {1} for a character appearing once.

test_str = re.sub('<[^<>]*>', '', test_str)

However, note that is far more appropriate to use a dedicated HTML parser like BeautifulSoup instead of regular expressions to get data from HTML.

answered May 17 '23 at 23:33

Unmitigated

76,500
11
62
80

Thank you so much! Wow, it looks like re.sub eliminates all occurences of what I was trying to do. – Seeingstars May 17 '23 at 23:36

Regular Expression in involving AND with python

1 Answers1