1

I am trying to use BeautifulSoup to parse a html document. I have tried to write a code which can parse the document, find all tables and remove those that have a numeric/ alphanumeric ratio > 15%. I have used code given as an answer to this previous question:

Delete HTML element if it contains a certain amount of numeric characters

but for some reason the table.decompose() argument is flagging as an error. I'd appreciate any help I could get with this. Please note that I am a beginner, and so, though I do try, I don't always understand the more complicated solutions!

Here is the code:

test_file = 'locationoftestfile.html'


# Define a function to remove tables which have numeric characters/ alphabetic and numeric characters > 15%
def remove_table(table):
        table = re.sub('<[^>]*>', ' ', str(table))
        numeric = sum(c.isdigit() for c in table)
        print('numeric: ' + str(numeric))
        alphabetic = sum(c.isalpha() for c in table)
        print('alpha: ' + str(alphabetic))
        try:
                ratio = numeric / float(numeric + alphabetic)
                print('ratio: '+ str(ratio))
        except ZeroDivisionError as err:
                ratio = 1
        if ratio > 0.15: 
            table.decompose()


# Define a function to create our Soup object and then extract text
def file_to_text(file):
    soup_file = open(file, 'r')
    soup = BeautifulSoup(soup_file, 'html.parser')
    for table in soup.find_all('table'):
        remove_table(table)
    text = soup.get_text()
    return text


file_to_text(test_file)

This is the output/error I am receiving:

numeric: 1
alpha: 55
ratio: 0.017857142857142856
numeric: 9
alpha: 88
ratio: 0.09278350515463918
numeric: 20
alpha: 84
ratio: 0.19230769230769232
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-28-c7e380df4fdc> in <module>
----> 1 file_to_text(test_file)

<ipython-input-27-9fb65cec1313> in file_to_text(file)
     16                 ratio = 1
     17         if ratio > 0.15:
---> 18             table.decompose()
     19     text = soup.get_text()
     20     return text

AttributeError: 'str' object has no attribute 'decompose'

Please note that the table.decompose() argument is different to the one given in the solution I have linked. That solution uses

   return True
else:
   return False

but, perhaps naively, I don't understand how that would remove the table.

Cornflour
  • 55
  • 1
  • 7
  • 1
    That code looks quite _"wild"_ to me (parsing html with regex). Can you share HTML or can you edit the question and put there sample (small) input and expected output? – Andrej Kesely Jan 08 '20 at 19:59
  • I agree with @AndrejKesely, the `table.decompose()` error might be the least of your problems. – AMC Jan 08 '20 at 20:03
  • You are probably both very very right. This code is a mash up of code provided by my lecturer (the second def) and code I took from the link. Thankfully the solution below appears to have worked and so, for now, that will do for me! – Cornflour Jan 08 '20 at 20:14

1 Answers1

0
table = re.sub('<[^>]*>', ' ', str(table))

This overwrites the parameter 'table' with a string. You probably want to use another name for the variable here. E.g.

def remove_table(table):
    table_as_str = re.sub('<[^>]*>', ' ', str(table))
    numeric = sum(c.isdigit() for c in table_as_str)
    print('numeric: ' + str(numeric))
    alphabetic = sum(c.isalpha() for c in table_as_str)
    print('alpha: ' + str(alphabetic))
    try:
            ratio = numeric / float(numeric + alphabetic)
            print('ratio: '+ str(ratio))
    except ZeroDivisionError as err:
            ratio = 1
    if ratio > 0.15: 
        table.decompose()
Fredrik Håård
  • 2,856
  • 1
  • 24
  • 32