0

I met with the problem of parsing the filename encoded in html url encoding.

Content-Transfer-Encoding: base64
Content-Disposition: attachment;
 filename*0*=utf-8''%48%61%72%6D%6F%6E%6F%67%72%61%6D;
 filename*1*=%32%30%31%38%20%C5%81%75%6B%61%73%7A%65%77;
 filename*2*=%61%20%33%35%2E%70%64%66

After get_filename it returns already filename decoded with extended ASCII characters. In that case decode_header can not deal with anymore. Here is exception:

  File "/usr/lib/python2.7/email/header.py", line 73, in decode_header
header = str(header)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0141' in position 26: ordinal not in range(128)

Here is a piece of code to getting the filename:

for part in msg.walk():
        if part.get_content_maintype() == 'multipart':
            continue
        content = part.get_payload(decode=True)
        if content:
            filename = part.get_filename()
            if filename:
                filename = decode_header(filename)
                [...]

All other normal encoding seems to work. What is the problem, sometimes get_filename return decoded string or encoded. Could you advise me how can I resolve this?


Exact content when I get UnicodeError is "Harmonogram 2018 Łukaszewa 35.pdf"

Second message contains and it's working:

Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="To =?UTF-8?b?Y8OzxbwsIMW8ZQ==?= ze Szwecji, to nic,
=?UTF-8?b?xbxl?= ze Szwecji..xlsx"
  • Do you really need to use Python 2 for this? Moving to Python 3 should definitely be in your near-term roadmap anyway, and will solve this issue at least partially out of the box. Try Python 3.6 or newer for a slightly overhauled version of the `email` library. – tripleee Jul 19 '18 at 08:59
  • Possible duplicate of [Decoding RFC 2231 headers](https://stackoverflow.com/questions/18094309/decoding-rfc-2231-headers) – tripleee Jul 19 '18 at 09:00
  • Could you show the exact content of filename when you get the UnicodeEncodeError? – Serge Ballesta Jul 19 '18 at 10:01
  • Exact content when I get UnicodeError is "Harmonogram 2018 Łukaszewa 35.pdf" – user3177697 Jul 23 '18 at 12:23

1 Answers1

0

try this

from email.header import Header,decode_header,make_header
file_name=make_header(decode_header(part.get_filename()))
Hymns For Disco
  • 7,530
  • 2
  • 17
  • 33