1

For transforming a html-formatted file to a plain text file with Python, I need to delete all tables if the text within the table contains more than 40% numeric characters.

Specifically, I would like to:

  1. identify each table element in a html file

  2. calculate the number of numeric and alphabetic characters in the text and the correpsonding ratio, not considering characters within any html tags
. Thus, delete all html tags.
  3. delete the table if its text is composed of more than 40% numeric characters.
 Keep the table if it contains less than 40% numeric characters
.

I defined a function that is called when the re.sub command is run. The rawtext variable contains the whole html-formatted text I want to parse. Within the function, I try to process the steps described above and return a html-stripped version of the table or a blank space, depending on the ratio of numeric characters. However, the first re.sub command within the function seems to delete not only tags, but everything, including the textual content.

def tablereplace(table):
    table = re.sub('<[^>]*>', ' ', str(table))
    numeric = sum(c.isdigit() for c in table)
    alphabetic = sum(c.isalpha() for c in table)
    try:
            ratio = numeric / (numeric + alphabetic)
            print('ratio = ' + ratio)
    except ZeroDivisionError as err:
            ratio = 1
    if ratio > 0.4:
            emptystring = re.sub('.*?', ' ', table, flags=re.DOTALL)  
            return emptystring
    else:
            return table

rawtext = re.sub('<table.+?<\/table>', tablereplace, rawtext, flags=re.IGNORECASE|re.DOTALL)

If you have an idea on what might be wrong with this code, I would be very happy if you share it with me. Thank you!

fbn
  • 13
  • 4
  • Don't use regex to parse HTML, use instead a tool made up for this purpose. See here https://stackoverflow.com/questions/11709079/parsing-html-using-python – shadowsheep Nov 04 '18 at 17:31
  • See [this legendary answer](https://stackoverflow.com/a/1732454/189018) on how to parse HTML with regular expressions (spoiler alert: you don't, but the post is seriously epic). – digitalarbeiter Nov 04 '18 at 20:22

2 Answers2

1

As I suggested you in comments, I wouldn't use regex to parse and use HTML in code. For example you could use a python library build up for this purpose like BeautifulSoup.

Here an example on how to use it

#!/usr/bin/python
try:
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup
html = """<html>
<head>Heading</head>
<body attr1='val1'>
    <div class='container'>
        <div id='class'>Something here</div>
        <div>Something else</div>
<table style="width:100%">
  <tr>
    <th>Firstname</th>
    <th>Lastname</th> 
    <th>Age</th>
  </tr>
  <tr>
    <td>Jill</td>
    <td>Smith</td> 
    <td>50</td>
  </tr>
  <tr>
    <td>Eve</td>
    <td>Jackson</td> 
    <td>94</td>
  </tr>
</table>
    </div>
</body>
</html>"""
parsed_html = BeautifulSoup(html, 'html.parser')
print parsed_html.body.find('table').text

So you could end up with a code like that (just to give you an idea)

#!/usr/bin/python
import re
try:
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup



def tablereplace(table):
    table = re.sub('<[^>]*>', ' ', str(table))
    numeric = sum(c.isdigit() for c in table)
    print('numeric: ' + str(numeric))
    alphabetic = sum(c.isalpha() for c in table)
    print('alpha: ' + str(alphabetic))
    try:
            ratio = numeric / float(numeric + alphabetic)
            print('ratio: '+ str(ratio))
    except ZeroDivisionError as err:
            ratio = 1
    if ratio > 0.4:
            return True
    else:
            return False

table = """<table style="width:100%">
  <tr>
    <th>Firstname</th>
    <th>Lastname</th> 
    <th>Age</th>
  </tr>
  <tr>
    <td>3241424134213424214321342424214321412</td>
    <td>213423423234242142134214124214214124124</td> 
    <td>213424214234242</td>
  </tr>
  <tr>
    <td>124234412342142414</td>
    <td>1423424214324214</td> 
    <td>2134242141242341241</td>
  </tr>
</table>
"""

if tablereplace(table):
        print 'replace table'
        parsed_html = BeautifulSoup(table, 'html.parser')
        rawdata = parsed_html.find('table').text
        print rawdata

UPDATE: Anyway just this line of your code strips away all HTML tags, as you will know 'cause you are using it for char/digit counting purpose

table = re.sub('<[^>]*>', ' ', str(table))

But it's not safe, because you could also have <> inside the text of your tags or the HTML could be shattered or misplaced

I left it where it is because for the example it's working. But consider to use BeautifulSoup for all HTML management.

shadowsheep
  • 14,048
  • 3
  • 67
  • 77
0

Thank you for your replies so far!

After intensive research, I found the solution to the mysterious deletion of the whole match. It seemed that the function only considered the first 150 or so characters of the match. However, if you specify table = table.group(0), the whole match is processed. group(0) accounts for the big difference here.

Below you can find my updated script thats works properly (also includes some other minor changes):

def tablereplace(table):
    table = table.group(0)
    table = re.sub('<[^>]*>', '\n', table)
    numeric = sum(c.isdigit() for c in table)
    alphabetic = sum(c.isalpha() for c in table)
    try: 
        ratio = numeric / (numeric + alphabetic)
    except ArithmeticError:
        ratio = 1
    else:
        pass
    if ratio > 0.4:
        emptystring = ''  
        return emptystring
    else:
        return table 
rawtext = re.sub('<table.+?<\/table>', tablereplace, rawtext, flags=re.IGNORECASE|re.DOTALL)
fbn
  • 13
  • 4