I've been struggling for a while trying to get the right regular expression for the following task:
I want to strip data from table tags in an html file using python. To do this, my approach is to recursively do the following (with the HTML line between the tags stored as a string):
s = "<td ...>Desired Content</td>"
- Re-assign the string, s to the string with everything between '<...>' removed.
s = re.sub('<{1}(not '<' and not '>').*>{1}', '', s)
- Repeat until left with s = "Desired Content".
My question is how to fulfill the part bolded in parentheses. Thank you.your text
I tried
import re
test_str = '<td style="color:blue">Hello</td>'
test_str = re.sub('<{1}^[<>].*>{1}','',test_str)
print(test_str)
which you can see leaves my test string unchanged. What am I doing wrong?
The above code I expect to give me test_str = "Hello</td>", which I would them feed back to this method to then extract "</td>", giving me "Hello".