4

I'm near a total outsider of programming, just interested in it. I work in a Shipbrokering company and need to match between positions (which ship will be open at where, when) and orders (what kind of ships will be needed at where, when for what kind of employment). And we send and receive such info (positions and orders) by emails to and from our principals and co-brokers. There are thousands of such emails each day. We do the matching by reading the emails manually.

I want to build an app to do the matching for us.

One important part of this app will do the information extraction from email text.

==> My question is how do I use Python to extract unstructured info into structured data.

Sample email of an order [annotation in the brackets, but is not included in the email]:

Email Subject: 20k dwt requirement, 20-30/mar, Santos-Conti

    Content: 
    Acct ABC [Account Name]
    Abt 20,000 MT Deadweight [Size of Ship Needed]
    Delivery to make Santos [Delivery Point/Range, Owners will deliver the ship to Charterers here]
    Laycan 20-30/Mar [Laycan (the time spread in which delivery can be accepted]
    1 time charter with grains [What kind of Empolyment/Trade, Cargo]
    Duration about 35 days [Duration]
    Redelivery 1 safe port Continent [Redelivery Point/Range, Charterers will redeliver the ship back to Owners here.]

    Broker name/email/phone...

End Email

Same email above can be written in many different ways - some writes in one line, some use l/c instead of laycan... And there are emails for positions with ship's name, open port, date range, ship's deadweight and other specs.

How can I extract the info and put it into structured data, with Python? Let's say I have put all email contents into text files. Thanks.

Timathon
  • 1,049
  • 9
  • 11

1 Answers1

1

Below is a possible approach:

Step 1: Classify the mails in categories using the subject and/or message in the mail.

As you stated one category is of mails requesting position and the other is of mails of order. Machine Learning can be used to classify. You can use set of previous mails as training corpus. You might consider using NLTK(Natural Langauage Toolkit) for Python. Here is the link on text classification using NLTK.

Step 2: Once an email is identified as an order mail, process it to fetch the details(account name, size, time spread etc.) As you mentioned the challenge here is that there is no fixed format for these data. To solve this problem, you might consider preparing an exhaustive list of synonyms for each label(like for account the list could be like ['acct', 'a/c', 'account', 'acnt']). This should be done once, by going through a fixed volume of previous mails.

To make the solution more effective, you could consider implementing option for active learning (i.e., prompt the user if in a mail a lable is found which is not found in any list. E.g. in a mail, if "accnt" is used, it wont be resolved, hence user should be prompted to ask in which category it falls.)

Once a lable is identifies, you can use basic string operations, to parse the email in a fetch relevant data in structured format.

You can refer to this discussion for a better understanding.

Community
  • 1
  • 1
kundan
  • 1,278
  • 14
  • 27
  • 1
    Instead on trying the nearly impossible task of parsing unstructured mails I'd seperate my client base into important and non-important clients (with a list of email adresses for the important). A python program could watch the inbox and send a reply to each non-important client asking him to fill in a web form with the data in structured form. (Your reply would politely ask to do this to speed up things and you'd say you would process the mail manually if they didn't answer) This reply would include an ID. If the link in the mail is used that mail would be marked as processed. – 576i Mar 24 '14 at 10:21
  • @576i I personally like your idea. But I believe the intent of the question is to understand how Python can be of help if the mails are put in text files. – kundan Mar 24 '14 at 11:30
  • @kundan, thanks a lot for the advice. Machine Learning/NTLK/Active Learning seems to be good approach to solve the information extraction challenge. Will take some time and dig into it and see what I can do with it. Meanwhile, from the link in your last line, Tal Weiss brought up Pyparsing for semi-structured test. I found pyparsing easy to read. Will start from there. – Timathon Mar 25 '14 at 10:11
  • @576i, thanks for advice. It may be a little difficult to ask our counter-parties to tailor-make their emails according to only our requirement. But if all players use the same email format, it will save time from matching and people will focusing on other part of ship chartering. – Timathon Mar 25 '14 at 10:49