Extracting parts of emails in text files

Question

I am trying to do some text processing corpus which has emails.

I have a main directory, under which I have various folders. Each folder has many .txt files. Each txt file is basically the email conversations.

To give an example of how my text file looks like with emails, am copying a similar looking text file of emails from publicly available enron email corpus. I have the same type of text data with multiple emails in one text file.

An example text file can look like below:

Message-ID: <3490571.1075846143093.JavaMail.evans@thyme>
Date: Wed, 8 Sep 1999 08:50:00 -0700 (PDT)
From: steven.kean@enron.com
To: kelly.kimberly@enron.com
Subject: Re: India And The WTO Services Negotiation
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Steven J Kean
X-To: Kelly Kimberly
X-cc: 
X-bcc: 
X-Folder: \Steven_Kean_Dec2000_1\Notes Folders\All documents
X-Origin: KEAN-S
X-FileName: skean.nsf

fyi
---------------------- Forwarded by Steven J Kean/HOU/EES on 09/08/99 03:49 
PM ---------------------------

Joe Hillings@ENRON
09/08/99 02:52 PM
To: Joe Hillings/Corp/Enron@Enron
cc: Sanjay Bhatnagar/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, Terence H 
Thorn/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, Ashok 
Mehta/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, John 
Ambler/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, Steven J Kean/HOU/EES@EES, 
Jeffrey Sherrick/Corp/Enron@Enron 
Subject: Re: India And The WTO Services Negotiation  

Sanjay: Some information of possible interest to you. I attended a meeting 
this afternoon of the Coalition of Service Industries, one of the lead groups 
promoting a wide range of services including energy services in the upcoming 
WTO GATTS 2000 negotiations. CSI President Bob Vastine was in Delhi last week 
and met with CII to discuss the upcoming WTO. CII apparently has a committee 
looking into the WTO. Bob says that he told them that energy services was 
among the CSI recommendations and he recalls that CII said that they too have 
an interest.

Since returning from the meeting I spoke with Kiran Pastricha and told her 
the above. She actually arranged the meeting in Delhi. She asked that I send 
her the packet of materials we distributed last week in Brussels and London. 
One of her associates is leaving for India tomorrow and will take one of 
these items to Delhi. 

Joe

Joe Hillings
09/08/99 11:57 AM
To: Sanjay Bhatnagar/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT
cc: Terence H Thorn/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, Ashok 
Mehta/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, John 
Ambler/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, Steven J Kean/HOU/EES@EES, 
Jeffrey Sherrick/Corp/Enron@Enron (bcc: Joe Hillings/Corp/Enron)
Subject: India And The WTO Services Negotiation

Sanjay: First some information and then a request for your advice and 
involvment.

A group of US companies and associations formed the US WTO Energy Services 
Coalition in late May and has asked the US Government to include "energy 
services" on their proposed agenda when the first meeting of the WTO GATTS 
2000 ministerial convenes late this year in Seattle. Ken Lay will be among 
the CEO speakers. These negotiations are expected to last three years and 
cover a range of subjects including agriculture, textiles, e-commerce, 
investment, etc.

This morning I visited with Sudaker Rao at the Indian Embassy to tell him 
about our coalition and to seek his advice on possible interest of the GOI. 
After all, India is a leader in data processing matters and has other 
companies including ONGC that must be interested in exporting energy 
services. In fact probably Enron and other US companies may be engaging them 
in India and possibly abroad.

Sudaker told me that the GOI has gone through various phases of opposing the 
services round to saying only agriculture to now who knows what. He agrees 
with the strategy of our US WTO Energy Services Coalition to work with 
companies and associations in asking them to contact their government to ask 
that energy services be on their list of agenda items. It would seem to me 
that India has such an interest. Sudaker and I agree that you are a key 
person to advise us and possibly to suggest to CII or others that they make 
such a pitch to the GOI Minister of Commerce.

I will ask Lora to send you the packet of materials Chris Long and I 
distributed in Brussels and London last week. I gave these materials to 
Sudaker today.

Everyone tells us that we need some developing countries with an interest in 
this issue. They may not know what we are doing and that they are likely to 
have an opportunity if energy services are ultimately negotiated.

Please review and advise us how we should proceed. We do need to get 
something done in October.
Joe

PS Terry Thorn is moderating a panel on energy services at the upcoming World 
Services Congress in Atlanta. The Congress will cover many services issues. I 
have noted in their materials that Mr. Alliwalia is among the speakers but 
not on energy services. They expect people from all over the world to 
participate.

So as you see there can be basically multiple emails in one text file with not much clear separation rule except new email headers (To, From etc).

I can do the os.walk in the main directory and then it would go through each of the sub directory, parse each of the text file in that sub-directory etc and repeat it for other sub-directory and so on.

I need to extract certain parts of each email within a text file and store it as new row in a dataset (csv,pandas dataframe etc).

Parts which can be helpful to extract and store as columns of a row in a dataset. Each row of this dataset can then be each email within each text file.

Fields:

Original Email content | From (Sender)| To (Receipient) | cc (Receipient)| Date/Time Sent| Subject of Email|

Edit: I looked at the duplicate question added. That considers a fixed spec and boundary. That's not the case here. I am looking for a simple regular expression way of extracting different fields as mentioned above

Possible duplicate of [Parse a string of multipart data](https://stackoverflow.com/questions/45024538/parse-a-string-of-multipart-data) — stovfl, Oct 13 '18 at 08:08
Duplicate considers fixed spec. Not the case here. I am also looking for simple regular expression way of parsing the data in above format. Thnx — Baktaawar, Oct 13 '18 at 17:53

Pedro Rodrigues · Answer 1 · 2018-10-16T04:26:07.360

0

^Date:\ (?P<date>.+?$)
.+?
^From:\ (?P<sender>.+?$)
.+?
^To:\ (?P<to>.+?$)
.+?
^cc:\ (?P<cc>.+?$)
.+?
^Subject:\ (?P<subject>.+?$)

Make sure you're using dotall, multiline, and extended modes on your regex engine.

For the example you posted it works at least, it captures everything in different groups (you may need to enable that on the regex engine as well, depending on which it is)

Group `date`    63-99   `Wed, 8 Sep 1999 08:50:00 -0700 (PDT)`
Group `sender`  106-127 `steven.kean@enron.com`
Group `to`  132-156 `kelly.kimberly@enron.com`
Group `cc`  650-714 `Sanjay Bhatnagar/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, Terence H `
Group `subject` 930-974 `Re: India And The WTO Services Negotiation  `

https://regex101.com/r/gHUOLi/1

And use it to iterate over your stream of text, you mention python so there you go:

def match_email(long_string):
    regex = r'^Date:\ (?P<date>.+?$)
              .+?
              ^From:\ (?P<sender>.+?$)
              .+?
              ^To:\ (?P<to>.+?$)
              .+?
              ^cc:\ (?P<cc>.+?$)
              .+?
              ^Subject:\ (?P<subject>.+?$)'
    # try to match the thing
    match = re.search(regex, long_string.strip(), re.I | re.X)

    # if there is no match its over
    if match is None:
        return None, long_string

    # otherwise, get it
    email = match.groupdict()

    # remove whatever matched from the original string
    if email is not None:
        long_string = long_string.strip()[match.end():]

    # return the email, and the remaining string
    return email, long_string


# now iterate over the long string
emails = []
email, tail = match_email(the_long_string)
while email is not None:
    emails.append(email)
    email, tail = match_email(tail)

print(emails)

Thats directly stolen from here just some names changed and stuff.

edited Oct 16 '18 at 04:26

answered Oct 13 '18 at 22:18

Pedro Rodrigues

2,520
2
27
26

I want to extract the field corresponding "From", "To", "cc", "subject" , "time" of each email in as separate record in a dataset – Baktaawar Oct 14 '18 at 00:14
just edited the thing, does pretty much what you asked for – Pedro Rodrigues Oct 14 '18 at 07:14
In ur case it is only picking the first email. I need for all emails in the text – Baktaawar Oct 15 '18 at 21:27
iterate over, you won't be able to do it in one go with regex. It doesn't work like that. – Pedro Rodrigues Oct 15 '18 at 23:38
What do u mean iterate? U don't know when is a new email beginning and when that u can extract that part of it to run thru a regex. I think there wud be a way to get all of them in one text – Baktaawar Oct 15 '18 at 23:52
I think there should be another way to extract all at once in the form of a list of object. Like if one uses findall? it can find all matches in a list? I used re.DOTALL it does give some but not as I would want it – Baktaawar Oct 17 '18 at 21:33
Well, you think wrong. The problem is not that you can't match everything in one go, is that regex engines will not keep track of what belongs to where. This is a common pitfall with regex engines, you may want to a give a read at my answer to an analogous scenario in another question (https://stackoverflow.com/a/52800845/3343753) – Pedro Rodrigues Oct 18 '18 at 16:32
But isn't re.findall for a reason? If u give the whole text as string, it will find all instances of the pattern in the whole string as a list of pattern returned. Shouldn't that then take care of getting all patterns at one go? – Baktaawar Oct 23 '18 at 17:27
It does, but it won't keep track of what it email those things it matched belong to. I think I've already went over this, just give a proper second read. Its like having to lists `From` and `cc` and hoping their the same size, and all ids in both lists belong to the same `email` lets say; if it so happens one email does not have a `cc`, everything breaks. – Pedro Rodrigues Oct 23 '18 at 20:07

Extracting parts of emails in text files

1 Answers1