0

I'm working on a python script using pypff to open Outlook PST files and extract useful information. I'm following the code posted in here.

I'm trying to get the names of the attachments for each email but the only methods for type 'attachment' is get_size(), read_buffer() and seek_offset(), which aren't useful to me.

The read_buffer method gives a long string, something like x00\x11\x00\x02\x01\x02\x02\x01\x03\x04\x07\x05\...

How can I decode it?

Masoud Rahimi
  • 5,785
  • 15
  • 39
  • 67
Elliot G
  • 79
  • 8

1 Answers1

0

you can try decoding with ascii first.

print((msg.get_attachment(0).read_buffer(attach_size)).decode('ascii', errors="ignore"))

I think Microsoft is using more than one way to encode different parts of attachments, so no single decoding can do perfectly. If ascii cannot decode enough content, you can try them all. For different Python versions, check it out here.

# 98 encodings in python3.5/6/7
decode = ['ascii','big5','big5hkscs','cp037','cp273',
          'cp424','cp437','cp500','cp720','cp737',
          'cp775','cp850','cp852','cp855','cp856',
          'cp857','cp858','cp860','cp861','cp862',
          'cp863','cp864','cp865','cp866','cp869',
          'cp874','cp875','cp932','cp949','cp950',
          'cp1006','cp1026','cp1125','cp1140','cp1250',
          'cp1251','cp1252','cp1253','cp1254','cp1255',
          'cp1256','cp1257','cp1258','cp65001','euc_jp',
          'euc_jis_2004','euc_jisx0213','euc_kr','gb2312','gbk',
          'gb18030','hz','iso2022_jp','iso2022_jp_1','iso2022_jp_2',
          'iso2022_jp_2004','iso2022_jp_3','iso2022_jp_ext','iso2022_kr','latin_1',
          'iso8859_2','iso8859_3','iso8859_4','iso8859_5','iso8859_6',
          'iso8859_7','iso8859_8','iso8859_9','iso8859_10','iso8859_11',
          'iso8859_13','iso8859_14','iso8859_15','iso8859_16','johab',
          'koi8_r','koi8_t','koi8_u','kz1048','mac_cyrillic',
          'mac_greek','mac_iceland','mac_latin2','mac_roman','mac_turkish',
          'ptcp154','shift_jis','shift_jis_2004','shift_jisx0213','utf_32',
          'utf_32_be','utf_32_le','utf_16','utf_16_be','utf_16_le',
          'utf_7','utf_8','utf_8_sig']

# Select the best decoder
items = []
for item in encode:
    attach_size = msg.get_attachment(0).get_size()
    content = (msg.get_attachment(0).read_buffer(attach_size)).decode(item, errors="ignore")
    
    # I know 'sample_content' is in the attachment, so it's easy to see which ones can decode it.
    if 'sample_content' in content:
        items.append(item)

print(items)

If you don't know what's in the content, you can try workarounds. For instance, in the loop you can find one decoding that leaves least number of "\x", since before encoding your content looks like this "\x93\x93\xfa\x8c\xd3\x1a\xc6".

If anyone has better ways of decoding attachments, please leave a comment here, thank you.

Community
  • 1
  • 1
KuangHao
  • 73
  • 9