0

I am working on a CRM, where I am receiving hundreds of emails for offers/requirements per day. I am building an API that will process the email and will insert entries in the CRM.

I am using imap_tools to get the mails in my API. but I am stuck at the point when there's a thread/conversation. I read some articles regarding using reference or in-reply-to header from the mail. but unlucky so far. I have also tried using the message-id but it gave me the same email thread instead of multiple emails.

I am getting an email thread/conversation as a single email and I want to get separated emails so I can process them easily.

here's what I have done so far.

from imap_tools import MailBox
with MailBox('mail.mail.com').login('abc@abc.com', 'password', 'INBOX') as mailbox:
for msg in mailbox.fetch():
   From = msg.headers['from'][0]
   To = msg.headers['to'][0]
   subject = msg.headers['subject'][0]
   received_date = msg.headers['date'][0]
   raw_email = msg.text
   process_email(raw_email) #processing the email
cokeman19
  • 2,405
  • 1
  • 25
  • 40
  • To clarify, you mean in a single Email Body you are receiving all the previous emails in the thread along with the new mail? – shoaib30 Aug 02 '21 at 11:40
  • yes i am receiving them all in single email body. – usman_gulzar Aug 02 '21 at 11:41
  • https://github.com/ikvk/imap_tools#email-attributes – Vladimir Aug 02 '21 at 12:09
  • @Vladimir can you please explain the solution which you have in mind because i have tried almost all of the attributes provided. – usman_gulzar Aug 02 '21 at 12:15
  • @this in hint for using attributes, that ready to work. You are on the right way with reference or in-reply-to, but there is no magic – Vladimir Aug 02 '21 at 12:17
  • @Vladimir if i access the email using the header with reference or in-reply-to, it still provides me with the thread instead of giving me single email. – usman_gulzar Aug 02 '21 at 13:03
  • You have several partly overlapping problems. Fix one at a time. First, find out how you can use the `references` to get the text of the referenced messages, and get that working. This part is the simple part. The much harder part is to scan the text of the new message for long extracts from each of the older messages, and mark those as quotes. Good luck. – arnt Aug 02 '21 at 20:46
  • @arnt if I get the mail using reference then it returns email thread instead of single mail. – usman_gulzar Aug 03 '21 at 07:07
  • @usman_gulzar see if this logic helps https://stackoverflow.com/a/474174/5236575 – shoaib30 Aug 03 '21 at 07:44
  • If you ask for the text of message 1234, you get what the sender put there. If the sender included text from other messages, the text from other messages is included in what you get. So if you want to get *just the text that is unique to message 1234* then you need to use `references` to find those other messages, and use comparative text processing to isolate the part you want. – arnt Aug 03 '21 at 09:13

1 Answers1

1

The issue you are facing is not related to the headers reference or in-reply-to. Most email clients will append the previous email as quoted text to the new mail when you reply. Hence in a thread, a mail will have the body of all previous mails as quoted text.

In most cases, and I say most since the Email standards vary a lot from client to client, the client will quote the previous mail by pretending > before all quoted lines

new message

> old message
>> very old message

As a hacky solution, you can drop all lines that start with >

In python, you can splitlines() and filter

lines = email.splitlines()
new_lines = [i for i in lines if not i.startswith('>')]

or

new_lines = list(filter(lambda i: not i.startswith('>'), lines))

you may use regular expressions or other techniques too.

the issue with the solution is obvious, if an email contains > else where it will cause loss of information. Hence a more complicated approach is to select lines with > and compare them with the previous emails in the thread using references and remove those which match.

Google has their patented implementation here https://patents.google.com/patent/US7222299

Source: How to remove the quoted text from an email and only show the new text


Edit

I realized Gmail follows the > quoting and other clients may follow other methods. There's a Wikipedia article on it: https://en.wikipedia.org/wiki/Posting_style

conceptually the approach needed will be similar, but different types of clients will need to be handled

shoaib30
  • 877
  • 11
  • 24
  • I am not receiving any quoted text in the reply, its just plain text. – usman_gulzar Aug 03 '21 at 05:49
  • could you add a sample raw mail to your question. We can debug better that way – shoaib30 Aug 03 '21 at 06:10
  • here is the [link](https://docs.google.com/document/d/1U6darZOxr-VW9-yfA-NQvCobAxTcgb_lpu5zf8Avg3A/edit?usp=sharing) to sample text – usman_gulzar Aug 03 '21 at 07:01
  • I could be wrong, but I think this is how the client being used is treating it. The users most likely are using custom clients that are not adhering to adding `>` before the old mail lines. You could check through multiple mails and see if the horizontal line is constant, if yes then you can remove everything after that – shoaib30 Aug 03 '21 at 07:40
  • 1
    I'm afraid you'll have mails from multiple clients and you'll need to handle multiple different cases. Or go with Google's approach of using hashes from previous mails in `references` and remove the text corresponding to it. – shoaib30 Aug 03 '21 at 07:41