0

I get transaction emails from my bank everytime I make a transaction. It comes in html. I want to be able to get certain information like confirmation_number, date, amount, etc. from the html content.

I tried to use regex extraction and also BeautifulSoup but the results are ugly and unwieldy. For example, the html code doesn't come with any useful attributes so it's not easy to do a find() with attributes filter. See snippet of html code below:

<table style="border: 1px solid black; border-collapse: collapse">
    <tbody>
        <tr>
            <td colspan="2" style="border:1px solid black;padding:3px">
                <center>
                    <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                        <b>
                            Transfer Money Details
                        </b>
                    </font>
                </center>
            </td>
        </tr>
        <tr>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    Confirmation Number
                </font>
            </td>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    1594379907846
                </font>
            </td>
        </tr>
        <tr>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    Transaction Date and Time
                </font>
            </td>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    Friday, Jul 10 2020; 07:18:54 PM (GMT +8)
                </font>
            </td>
        </tr>
        <tr>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    Transfer From
                </font>
            </td>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    XXXX-XXX-247 (PESO SAVINGS)
                </font>
            </td>
        </tr>
        <tr>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    Transfer To
                </font>
            </td>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    XXXX-XXX-545
                </font>
            </td>
        </tr>
        <tr>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    Amount
                </font>
            </td>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    PHP 1,200.00
                </font>
            </td>
        </tr>
        <tr>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    Service Fee
                </font>
            </td>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    PHP 0.00
                </font>
            </td>
        </tr>
        <tr>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    Total Amount
                </font>
            </td>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    PHP 1,200.00
                </font>
            </td>
        </tr>
        <tr>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    Notes
                </font>
            </td>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    Mask filters
                </font>
            </td>
        </tr>
    </tbody>
</table>

I want to be able to have a dataframe or a dictionary that looks like this:

{
'Confirmation Number': '1594379907846',
'Transaction Date and Time': 'Friday, Jul 10 2020; 07:18:54 PM (GMT +8)',
'Transfer From': 'XXXX-XXX-247 (PESO SAVINGS)'
    ... and so on
}

The code I have:

def get_content(html_content):
  soup = BeautifulSoup(html_content, 'html.parser')
  rows = soup.find_all('tr')

  content_ls = []
  trans_details = {}
  for row in rows:
    cells = row.findChildren('td')
    for cell in cells:
      content_ls.append(cell.getText())

  trans_details['Confirmation Number'] = content_ls[2]
  trans_details['Date_Time'] = content_ls[4]
  trans_details['From'] = content_ls[6]
  trans_details['To'] = content_ls[8]
  trans_details['Amount'] = content_ls[10]
  trans_details['Notes'] = content_ls[12]

  return trans_details

produces this dictionary:

{'Amount': 'PHP 1,200.00',
 'Confirmation Number': '1594379907846',
 'Date_Time': 'Friday, Jul 10 2020; 07:18:54 PM (GMT +8)',
 'From': 'XXXX-XXX-247 (PESO SAVINGS)',
 'Notes': 'PHP 0.00',
 'To': 'XXXX-XXX-545'}

Is there a more elegant and pythonic way of doing it?

Ultimately, I'd like to produce a DataFrame, with columns 'Confirmation Number', 'Transaction Date and Time', and so on.

Thanks

gio888
  • 93
  • 2
  • 9
  • Could you extend the HTML example to have all the data you'd like to have in the output? And by more elegant way you mean what? Show us your code first. – baduker Nov 07 '20 at 09:24
  • save the table elements into pandas. Then you need to have a list which will have your entities name if the column values matches to them then store those rows result. – Pygirl Nov 07 '20 at 09:26
  • 1
    I would start with `dfs = pd.read_html(html_string)` and go from there as @Pygirl said – piterbarg Nov 07 '20 at 09:32
  • That example looks very straightforward. It's a table row with a key and a value that are both easy to extract as they're the only text in the row, and each is in its own `` element. I don't see how it could get much easier than that. So you'd just find all table rows, grab all the text from the first and second `` elements, and that would give you a key/value pair dictionary. Then you just look up "Confirmation Number" in that dictionary. If other values have similar structure, you're in great shape. BS ought to be great for this. – CryptoFool Nov 07 '20 at 09:33
  • Visit https://stackoverflow.com/questions/6325216/parse-html-table-to-python-list **Try to read about etree** – yash jain Nov 07 '20 at 09:37
  • Editing the original question by including all the HTML content and sample of my ugly code as suggested by baduker. Will also be trying out Pygirl's, Steve's and TOTO's suggestions. Will update with my results. Thanks guys. Still learning. – gio888 Nov 20 '20 at 10:46

1 Answers1

2

What you can do is to use lxml lib. It allows you to use xpath to find elements. Here is a method to extract information with the HTML you had provided.

def parse(html):
    root = etree.fromstring(html)
    trs = root.xpath("//tr")
    result = dict()
    for tr in trs:
        fonts = tr.xpath(".//font")
        key = fonts[0].text.strip()
        value = fonts[1].text.strip()
        result[key] = value

    return result
Dharman
  • 30,962
  • 25
  • 85
  • 135
TOTO
  • 307
  • 1
  • 6