I am trying to use BeautifulSoup to parse a html document. I have tried to write a code which can parse the document, find all tables and remove those that have a numeric/ alphanumeric ratio > 15%. I have used code given as an answer to this previous question:
Delete HTML element if it contains a certain amount of numeric characters
but for some reason the table.decompose() argument is flagging as an error. I'd appreciate any help I could get with this. Please note that I am a beginner, and so, though I do try, I don't always understand the more complicated solutions!
Here is the code:
test_file = 'locationoftestfile.html'
# Define a function to remove tables which have numeric characters/ alphabetic and numeric characters > 15%
def remove_table(table):
table = re.sub('<[^>]*>', ' ', str(table))
numeric = sum(c.isdigit() for c in table)
print('numeric: ' + str(numeric))
alphabetic = sum(c.isalpha() for c in table)
print('alpha: ' + str(alphabetic))
try:
ratio = numeric / float(numeric + alphabetic)
print('ratio: '+ str(ratio))
except ZeroDivisionError as err:
ratio = 1
if ratio > 0.15:
table.decompose()
# Define a function to create our Soup object and then extract text
def file_to_text(file):
soup_file = open(file, 'r')
soup = BeautifulSoup(soup_file, 'html.parser')
for table in soup.find_all('table'):
remove_table(table)
text = soup.get_text()
return text
file_to_text(test_file)
This is the output/error I am receiving:
numeric: 1
alpha: 55
ratio: 0.017857142857142856
numeric: 9
alpha: 88
ratio: 0.09278350515463918
numeric: 20
alpha: 84
ratio: 0.19230769230769232
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-28-c7e380df4fdc> in <module>
----> 1 file_to_text(test_file)
<ipython-input-27-9fb65cec1313> in file_to_text(file)
16 ratio = 1
17 if ratio > 0.15:
---> 18 table.decompose()
19 text = soup.get_text()
20 return text
AttributeError: 'str' object has no attribute 'decompose'
Please note that the table.decompose()
argument is different to the one given in the solution I have linked. That solution uses
return True
else:
return False
but, perhaps naively, I don't understand how that would remove the table.