I get transaction emails from my bank everytime I make a transaction. It comes in html. I want to be able to get certain information like confirmation_number, date, amount, etc. from the html content.
I tried to use regex extraction and also BeautifulSoup but the results are ugly and unwieldy. For example, the html code doesn't come with any useful attributes so it's not easy to do a find() with attributes filter. See snippet of html code below:
<table style="border: 1px solid black; border-collapse: collapse">
<tbody>
<tr>
<td colspan="2" style="border:1px solid black;padding:3px">
<center>
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
<b>
Transfer Money Details
</b>
</font>
</center>
</td>
</tr>
<tr>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
Confirmation Number
</font>
</td>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
1594379907846
</font>
</td>
</tr>
<tr>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
Transaction Date and Time
</font>
</td>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
Friday, Jul 10 2020; 07:18:54 PM (GMT +8)
</font>
</td>
</tr>
<tr>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
Transfer From
</font>
</td>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
XXXX-XXX-247 (PESO SAVINGS)
</font>
</td>
</tr>
<tr>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
Transfer To
</font>
</td>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
XXXX-XXX-545
</font>
</td>
</tr>
<tr>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
Amount
</font>
</td>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
PHP 1,200.00
</font>
</td>
</tr>
<tr>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
Service Fee
</font>
</td>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
PHP 0.00
</font>
</td>
</tr>
<tr>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
Total Amount
</font>
</td>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
PHP 1,200.00
</font>
</td>
</tr>
<tr>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
Notes
</font>
</td>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
Mask filters
</font>
</td>
</tr>
</tbody>
</table>
I want to be able to have a dataframe or a dictionary that looks like this:
{
'Confirmation Number': '1594379907846',
'Transaction Date and Time': 'Friday, Jul 10 2020; 07:18:54 PM (GMT +8)',
'Transfer From': 'XXXX-XXX-247 (PESO SAVINGS)'
... and so on
}
The code I have:
def get_content(html_content):
soup = BeautifulSoup(html_content, 'html.parser')
rows = soup.find_all('tr')
content_ls = []
trans_details = {}
for row in rows:
cells = row.findChildren('td')
for cell in cells:
content_ls.append(cell.getText())
trans_details['Confirmation Number'] = content_ls[2]
trans_details['Date_Time'] = content_ls[4]
trans_details['From'] = content_ls[6]
trans_details['To'] = content_ls[8]
trans_details['Amount'] = content_ls[10]
trans_details['Notes'] = content_ls[12]
return trans_details
produces this dictionary:
{'Amount': 'PHP 1,200.00',
'Confirmation Number': '1594379907846',
'Date_Time': 'Friday, Jul 10 2020; 07:18:54 PM (GMT +8)',
'From': 'XXXX-XXX-247 (PESO SAVINGS)',
'Notes': 'PHP 0.00',
'To': 'XXXX-XXX-545'}
Is there a more elegant and pythonic way of doing it?
Ultimately, I'd like to produce a DataFrame, with columns 'Confirmation Number', 'Transaction Date and Time', and so on.
Thanks