-1

I've been struggling for a while trying to get the right regular expression for the following task:

I want to strip data from table tags in an html file using python. To do this, my approach is to recursively do the following (with the HTML line between the tags stored as a string):

s = "<td ...>Desired Content</td>"

  1. Re-assign the string, s to the string with everything between '<...>' removed.

s = re.sub('<{1}(not '<' and not '>').*>{1}', '', s)

  1. Repeat until left with s = "Desired Content".

My question is how to fulfill the part bolded in parentheses. Thank you.your text

I tried

import re

test_str = '<td style="color:blue">Hello</td>'
test_str = re.sub('<{1}^[<>].*>{1}','',test_str)
print(test_str)

which you can see leaves my test string unchanged. What am I doing wrong?

The above code I expect to give me test_str = "Hello</td>", which I would them feed back to this method to then extract "</td>", giving me "Hello".

1 Answers1

-1

To negate a character class, ^ should be placed after [. In addition, you don't need to specify {1} for a character appearing once.

test_str = re.sub('<[^<>]*>', '', test_str)

However, note that is far more appropriate to use a dedicated HTML parser like BeautifulSoup instead of regular expressions to get data from HTML.

Unmitigated
  • 76,500
  • 11
  • 62
  • 80