4

I am trying to read through an outlook folder and get the ReceivedTime,CC,Subject,HTMLBody but extract the table into columns. I can pull 1) ReceivedTime,CC,Subject,HTMLBody into a dataframe and I can do 2) Extract the HTMLBody tables into a dataframe but am getting stuck on doing both 1) & 2) together.

Current code:

import win32com.client
import pandas as pd
from bs4 import BeautifulSoup


outlook = win32com.client.Dispatch("Outlook.Application")
mapi = outlook.GetNamespace("MAPI")

inbox = mapi.Folders[User@email.com'].Folders['Inbox'].Folders['Subfolder Name']
Mail_Messages = inbox.Items

for mail in Mail_Messages:
     receivedtime = mail.ReceivedTime.strftime('%Y-%m-%d %H:%M:%S')
     cc = mail.CC
     body = mail.HTMLBody
     html_body = BeautifulSoup(body,"lxml")
     html_tables = html_body.find_all('table')[0]

df = pd.read_html(str(html_tables),header=None)[0]
display(df)

The current data frame displays below. But I also want the related ReceivedTime, CC, & Subject.

0 1
0 Report Name Report.pdf
1 Team Name Team A
2 Project Name Project A
3 Unique ID Number 123456789
4 Due Date 1/1/2021

But would like column [0] to be the row headers instead. So that when each email is read it would produce a dataframe that looks like this, for all the emails in the inbox subfolder:

0 Report Name Team Name Project Name Unique ID Number Due Date ReceivedTime CC Subject
1 Report.pdf Team A Project A 123456789 1/5/2021 1/1/2021 4:38:44 AM User1@email.com, User2@email.com Action Required:Report A Coming due
2
3
4

But am getting stuck, still a begginer pythoner but all the other posts I've seen aren't quite getting me to what I'm trying to do. I appreciate any and all help with this.

KLC2021
  • 47
  • 1
  • 5
  • So you just want to transpose one dataframe (with html_body contents) and then join another dataframe (with CC, Received time) to it? Then this may be helpful to you: https://stackoverflow.com/questions/42381639/how-do-i-transpose-dataframe-in-pandas-without-index and this: https://stackoverflow.com/questions/33088010/pandas-column-bind-cbind-two-data-frames – Ewelina Luczak Apr 26 '21 at 20:35
  • anilewe, thank you for those, I tried them but still not quite what I'm trying to get at. I want to extract the the recievedtime, cc, subject and body from email. Within the body of the email there is a standard table that i would like to split the table data into 5 additional columns(to look like the 2nd table above). So maybe I am going about it the wrong way. I'm having trouble looping through the HTMLBody for all emails and splitting the html table into the additional columns still tied to the receivedtime, cc and subject. – KLC2021 Apr 27 '21 at 18:17
  • I misread your question, sorry. Check my answer, is this what you mean? It may be not the most efficient way, but at least working one. – Ewelina Luczak Apr 28 '21 at 21:46

1 Answers1

1

Try this:

import win32com.client
import pandas as pd
from bs4 import BeautifulSoup
from pprint import pprint

outlook = win32com.client.Dispatch("Outlook.Application")
mapi = outlook.GetNamespace("MAPI")

inbox = mapi.Folders['User@email.com'].Folders['Inbox'].Folders['Subfolder Name']
Mail_Messages = inbox.Items

# a list where contents of each e-mail - CC, receiv.time and subject will have been put
contents = []
column_names = ['Report Name', 'Team Name', 'Project Name', 'Unique ID Number', 'Due Date', 'ReceivedTime', 'CC', 'Subject']

for mail in Mail_Messages:

    body = mail.HTMLBody
    html_body = BeautifulSoup(body, "lxml")
    html_tables = html_body.find_all('table')

    # uncomment following lines if you want to have column names defined programatically rather than hardcoded
    # column_names = pd.read_html(str(html_tables), header=None)[0][0]
    # column_names = column_names.tolist()
    # column_names.append("CC")
    # column_names.append("Received Time")
    # column_names.append("Subject")

    # a list containing a single e-mail data - html table, CC, receivedTime and subject
    row = pd.read_html(str(html_tables), header=None)[0][1]
    row = row.tolist()
    row.append(mail.CC)
    row.append(mail.ReceivedTime.strftime('%Y-%m-%d %H:%M:%S'))
    row.append(mail.Subject)

    # appending each full row to a list
    contents.append(row)


# and finally converting a list into dataframe
df = pd.DataFrame(contents, columns=column_names)

pprint(df)
Ewelina Luczak
  • 379
  • 4
  • 13
  • anilewe thank you so much! That is exactly what I was trying to do and your code worked greatly! You saved me what was potentially another week of my frustration, can't thank you enough! – KLC2021 Apr 29 '21 at 02:04