1

I am reading email from imap lib in python which is working but i am reading body part and storing body part in database but some times python code returning error in decoding body , i am identifying content type and charset of body but dont understand how to handle all content type and charset some time it is coming text/plain , with utf-8 in some mail it is ascii/ISO-8859/window-1252.

Please help me how to handle for all charset.

find the below code which i am currently using to read email body only if required i will provide all my code.

Expected Result : convert/handle all charset of email body in UTF-8 format then in HTML to show it on portal.

 if email_message.is_multipart():
    html = None
    multipart = True
    for part in email_message.walk():
        print("%s, %s" % (part.get_content_type(), part.get_content_charset()))
        charset = part.get_content_charset()
        if part.get_content_charset() is None:
            # We cannot know the character set, so return decoded "something"
            text = part.get_payload(decode=True)
            continue
        if part.get_content_type() == 'text/plain' and part.get_content_charset() == 'utf-8':
            # print('text--->1')
            text = str(part.get_payload(decode=True))
            # text = html.decode("utf-8")
            # print(part.get_payload(decode=True))
        if part.get_content_type() == 'text/plain' and part.get_content_charset() != 'utf-8':
            # print('text--->2')
            html = part.get_payload(decode=True)
            # text1 = html.decode("utf-8")
            text1 = html.decode(part.get_content_charset()).encode('utf8')
        if part.get_content_type() == 'text/html' and part.get_content_charset() != 'windows-1252':
            html = part.get_payload(decode=True)
            # text1 = html.decode("utf-8")
            text1 = html.decode(part.get_content_charset()).encode('utf8')
        if part.get_content_type() == 'text/html' and part.get_content_charset() == 'windows-1252':
            html = part.get_payload(decode=True)
            text1 = html.decode("cp1252")
        # if part.get_content_type() == 'text/html' and part.get_content_charset() == 'windows-1252':
        #    html = part.get_payload(decode=True)
        #    text1 = html.decode("latin-1")
        # if text is not None:
        # print(text.strip())
        # prin('Rahul')
        # else:
    # print("text")    #    print( html.strip())
    # print(text1.strip())
    # print("text1")
    # print(text1)
    imageCount = 0
    imageKey = ''
    json_data = {}
    filedata = {}
    mydict1 = ''
    value = ''
    params = ''
    filename = ''
    newFileName = ''
    for part in email_message.walk():
        if part.get_content_maintype() == 'multipart':
            continue
        if part.get('Content-Disposition') is None:
            continue
        if part.get_content_type() == 'message/rfc822':
            part_string = (bytes(str(part), 'utf-8'))
            # part_string = bytes(str(part.get_payload(0)),'utf-8')
            print('EML Part')
            print(part_string)
            filename = part.get_filename()
            # filename = filename.replace('\r', '').replace('\n', '')
            # print(part_string)
            # print(('attachment wala'))
        else:
            part_string = part.get_payload(decode=True)
            # print(part_string)
            # print(('attachment wala'))
            filename = part.get_filename()
            # filename = filename.replace('\r', '').replace('\n', '')
        if filename is not None:
            filepart = []
            try:
                decodefile = email.header.decode_header(filename)
                print('decodefile')
                print(decodefile)
            except HeaderParseError:
                return filename
                #
            for line1, encoding1 in decodefile:
                enc = 'utf-8'
                #        print(encoding)
                if encoding1 is not None:  # else encoding
                    print(type(line1))
                    filepart.append((line1.decode(encoding1)))
                    print('line')
                    print(line1)
                    print(filepart)
                    filename = ''.join(filepart)[:1023]
                else:
                    filename = filename
            dot_position = filename.rfind('.')
            file_prefix = filename[0: dot_position]
            file_suffix = filename[dot_position: len(filename)]
            print(filename)
            print(file_prefix)
            print(file_suffix)
            # filename = filename.decode('utf-8')
            # subject = ''
            file_prefix = file_prefix.replace('/', '_')
            now = datetime.datetime.now()
            timestamp = str(now.strftime("%Y%m%d%H%M%S%f"))
            print('timestamp--->')
            print(timestamp)
            newFileName = file_prefix + "_" + timestamp + file_suffix
            newFileName = newFileName.replace('\r', '').replace('\n', '').replace(',', '')
            filename = filename.replace('\r', '').replace('\n', '').replace(',', '')
            sv_path = os.path.join(svdir, newFileName)
            mydict = filename + '$$' + newFileName
            mydict1 = mydict1 + ',' + mydict
            # print(mydict1)
            value, params = cgi.parse_header(part.get('Content-Disposition'))
            print(value)
            if value == 'inline':
                imageCount = imageCount + 1
                print("newFileName-->" + newFileName)
                filedata[imageCount] = newFileName
                print(filedata)
                json_data = (filedata)
            # inlineImages = inlineImages + ',' + newFileName + '{{' + str(imageCount) + '}}'
            # print(json_data)
            # print('TYPE-->')
            # print(type(raw_email))
            # print(type(part.get_payload(decode=1)))
            # if type(part.get_payload(decode=1)) is None:
            #    print('message Type')
            if not os.path.isfile(sv_path):
                # print('rahul1')
                try:
                    fp = open(sv_path, 'wb')
                    fp.write(part_string)
                    fp.close()
                except TypeError:
                    pass
                    fp.close()

else:
    print("%s, %s" % (email_message.get_content_type(), email_message.get_content_charset()))
    if email_message.get_content_charset() is None:
        # We cannot know the character set, so return decoded "something"
        text = email_message.get_payload(decode=True)
        continue
    if email_message.get_content_type() == 'text/plain' and email_message.get_content_charset() == 'utf-8':
        print('text--->1')
        text = str(email_message.get_payload(decode=True))
        # text = html.decode("utf-8")
        # print(part.get_payload(decode=True))
    if email_message.get_content_type() == 'text/plain' and email_message.get_content_charset() != 'utf-8':
        print('text--->2')
        html = email_message.get_payload(decode=True)
        # text1 = html.decode("utf-8")
        text1 = html.decode(email_message.get_content_charset()).encode('utf8')
    if email_message.get_content_type() == 'text/html' and email_message.get_content_charset() != 'windows-1252':
        html = email_message.get_payload(decode=True)
        # text1 = html.decode("utf-8")
        text1 = html.decode(email_message.get_content_charset()).encode('utf8')
    if email_message.get_content_type() == 'text/html' and email_message.get_content_charset() == 'windows-1252':
        html = email_message.get_payload(decode=True)
        text1 = html.decode("cp1252")
Rahul Gour
  • 487
  • 2
  • 7
  • 21
  • why did you indent it so much? – monkey Jan 01 '20 at 16:42
  • Can you check now is it readable now? – Rahul Gour Jan 01 '20 at 16:43
  • Yes - looks much better. – monkey Jan 01 '20 at 16:45
  • 1
    Why do you have tons of encodings branches? Your code is almost unreadable this way. I suggest to start over with a cleanup that handles correct mails just by the stdlib calls. I also suggest to split responsibility into more functions... – jerch Jan 01 '20 at 17:38
  • yes , that is why asking to handle these multiple encoding for all format , if there is a solution to handle it simply to decode that email body and definitely i will separate it multiple function now my concern is to remove multiple IF/ELSE to handle all this decoding for email body – Rahul Gour Jan 01 '20 at 17:59
  • You might want to switch to `policy.default`, this way you can use `get_content` automatically applying the content encoding. If thats fails - well then you have a bigger problem and need heuristics to deal with stuff thats totally "off" (caused by weakly coded mail clients, thus not unlikely to happen) – jerch Jan 01 '20 at 18:40
  • Maybe parts of my answer here helps you to clean up stuff https://stackoverflow.com/a/59545921/12548337. After that, we can get back to heuristics, if the issues persit. – jerch Jan 01 '20 at 18:56
  • Try to use high level lib: https://pypi.org/project/imap-tools/ – Vladimir Sep 22 '20 at 08:47

1 Answers1

1

How to handle all charset and content type when reading email from IMAP lib in Python

Simple answer:
Walk all message parts and apply the provided encoding setting. I see that you already do this (though I would rewrite your if-else cascades into something much simpler as the stdlib impl can deal with it just fine, your code is currently kinda messed up). That will work with standard conform mail content. But as always - there are many screwed up mail clients out there that dont care much about standard conformance (from good clients broken under certain circumstances to weakly scripted spam clients).

Long answer:
Its impossible to get this right for all messages. Decoding will fail for various reasons. Whenever decoding fails for a part the question is - what to do about it? Well you have basically these options:

  1. do nothing special, just go with the raw content
    You could just insert the raw byte content into your DB, and give users that content. Thats not very user friendly, and prolly not what you want if you have a big user base with business constraints coupled to it. Its still the much easier way to handle broken content. Its also the fallback if 2. still fails.

  2. try to decode content with some heuristics
    Here the nasty coding starts - whenever decoding of a part fails, there was something wrong with the annotated encoding and the actual content. So what can you do here? Well not much beside inspecting the content, try to find hints for the actual encoding (like pattern matching for UTF8 bit masks), or even brute force decoding. Clever heuristics might want to try out often seen encoding errors first (e.g. test for UTF8 or 8-bit encodings like latin-1 earlier). A good rule of thumb does not exists here, as messed up text encodings can go from just a wrongly announced encoding type up to several 8-bit encodings mixed up. While the first can most likely be spotted, the latter never can be resolved even by the most advanced heuristics and should always fall back to the solution in 1.

  3. Skip content
    Not recommended as it is likely to withhold important data from the user. Do this only if your sure, that the content is rubbish.

If you want to go the heuristics approach I suggest to do the following:

  • start with standard conform handling, any message that follows the standard should be handled correctly (in a perfect world you are done here)
  • implement 1. above as a general failover
  • collect data about typical failures, either from own users or search for typical faults in the internet (other mail clients have already identified those and handle them in a certain way)
  • implement the heuristics in 2., go with 80/20 rule (implement stuff first most users would benefit from), everything else gets handled by 1.
  • improve the heuristics over time
  • in any case - try to avoid 3.

This is a very general answer to your question, if you have a particular issue maybe you should address this more in detail.

jerch
  • 682
  • 4
  • 9