12

I have a large dump of data from an outlook email account that comes entirely in .msg files. A quick call to ubuntu's file method revealed that they were Composite Document File V2 Documents (whatever that means). I would really like to be able to read these files as plaintext. Is that possible at all?

Update: Turns out it wasn't totally possible to do what I wanted for large scale data mining on these kinds of files which was a bummer. In case you face the same issue I made a library to address this issue. https://github.com/Slater-Victoroff/msgReader

Documentation isn't great, but it's a pretty small library so it should be self explanatory.

Slater Victoroff
  • 21,376
  • 21
  • 85
  • 144
  • btw it's not "ubuntu's" file "method", it's POSIX (or at least UNIX) command. – JSmyth Jan 15 '16 at 20:52
  • 1
    Basically the same question is answered in the [more appropriate] Super User community - http://superuser.com/questions/99250/opening-a-msg-file-in-ubuntu – Juan May 06 '16 at 16:05

1 Answers1

12

I faced the same problem this morning. I didn't find any information on the file format but it was possible to extract the required information from the file using strings and grep:

strings -e l *.msg | grep pattern

The -e l (that's a small L) converts from UTF-16.

This will only work if you can grep the data you need from the file (i.e. all required lines contain a standard string or pattern).

Ben Mayhew
  • 144
  • 1
  • 2
  • Ah, forgot to update. I just went ahead and built a library that parse out a text version of the email from the raw .msg file. Will link to it for all poor souls facing this problem. https://github.com/Slater-Victoroff/msgReader – Slater Victoroff Mar 14 '13 at 06:31
  • Found documentation on the file format here: http://www.openoffice.org/sc/compdocfileformat.pdf ; I haven't read through it or tried to use it, but it may be useful. – retracile Jul 02 '13 at 14:10
  • @retracile Great find! I'll totally be looking into this. – Slater Victoroff Aug 04 '13 at 15:11
  • This might get you more than you bargained for. If you e.g. export a list of email addresses that are in a group to a `.msg` file and then delete one of the addresses, `strings -e l` will still show the deleted address, as the particular stream in the .msg file for that deleted address is truncated, but the actual address containing words not overwritten. – Anthon Feb 03 '17 at 08:30