4

I am trying to extract the content (as a string/text) of a .pst file.

I tried different answers but I did not find any relevant solution.

Outlook PST File Parsing in Python

Read PST files from win32 or pypff

Export PST and OST with pypff / libpff

I am mostly focused on the library libpff (https://github.com/libyal/libpff) but I do not think the library is helpful to extract the text of a pst.

My code:

import pypff
pst = pypff.file()
pst.open("my_pst_file.pst")

The code opens the pst, but I do not see how I can extract its content as txt.

Magofoco
  • 5,098
  • 6
  • 35
  • 77

1 Answers1

4

Yes, you can use pypff to extract text. I followed this link too (Export PST and OST with pypff / libpff).

The pypff.file() can be confusing since the developer didn't provide a decent document of every function and attribute for instructions. Took me a while to explore it myself.

Here is what I did recently.

# path to your pst file
opst = pypff.open(path)
root = opst.get_root_folder()

# 3 subfolders, for me, only 2nd one has content
# Use 'root.get_number_of_sub_folders()' to see which folder is blank
folder = root.get_sub_folder(1)
# 2 subfolders, the 2nd one is my inbox
inbox = folder.get_sub_folder(1)

# mail count in current folder
count = inbox.get_number_of_sub_items()

# Example of extracting info from one email
msg = inbox.get_sub_item(0)

subject = msg.subject
content = msg.plain_text_body.decode()
sender = msg.sender_name
header = msg.transport_headers
sent_time = msg.delivery_time

if msg.number_of_attachments > 0:
    # read from attachment 1
    size = attachment = msg.get_attachment(0).get_size()
    attachment_content = (msg.get_attachment(0).read_buffer(attach_size)).decode('ascii', errors='ignore')

For those who want to use pypff, don't use pip install. It only builds from version 20161119, which crashes a lot for me.

Build from newer version on their website. There's a setup.py, it should be easy to build.

For attachments, ascii decoder is not ideal. I have tried all 98 decoders in python3, and none can decode every byte. Which means, a single method cannot decode all. In my case, utf_16 can extract the content, which is good enough for me.

KuangHao
  • 73
  • 9