For transforming a html-formatted file to a plain text file with Python, I need to delete all tables if the text within the table contains more than 40% numeric characters.
Specifically, I would like to:
- identify each table element in a html file
- calculate the number of numeric and alphabetic characters in the text and the correpsonding ratio, not considering characters within any html tags . Thus, delete all html tags.
- delete the table if its text is composed of more than 40% numeric characters. Keep the table if it contains less than 40% numeric characters .
I defined a function that is called when the re.sub command is run. The rawtext variable contains the whole html-formatted text I want to parse. Within the function, I try to process the steps described above and return a html-stripped version of the table or a blank space, depending on the ratio of numeric characters. However, the first re.sub command within the function seems to delete not only tags, but everything, including the textual content.
def tablereplace(table):
table = re.sub('<[^>]*>', ' ', str(table))
numeric = sum(c.isdigit() for c in table)
alphabetic = sum(c.isalpha() for c in table)
try:
ratio = numeric / (numeric + alphabetic)
print('ratio = ' + ratio)
except ZeroDivisionError as err:
ratio = 1
if ratio > 0.4:
emptystring = re.sub('.*?', ' ', table, flags=re.DOTALL)
return emptystring
else:
return table
rawtext = re.sub('<table.+?<\/table>', tablereplace, rawtext, flags=re.IGNORECASE|re.DOTALL)
If you have an idea on what might be wrong with this code, I would be very happy if you share it with me. Thank you!