13

So I've read through this question on SO but it does not quite help me any. I want to import a Gmail generated mbox file into another webmail service, but the problem is it only allows 40 MB huge files per import.

So I somehow have to split the mbox file into max. 40 MB big files and import them one after another. How would you do this?

My initial thought was to use the other script (formail) to save each mail as a single file and afterwards run a script to combine them to 40 MB huge files, but still I wouldnt know how to do this using the terminal.

I also looked at the split command, but Im afraid it would cutoff mails. Thanks for any help!

Community
  • 1
  • 1
Alex
  • 9,911
  • 5
  • 33
  • 52

4 Answers4

18

I just improved a script from Mark Sechell's answer. As We can see, that script can parse the mbox file based on the amount of email per chunk. This improved script can parse the mbox file based on the defined-maximum-size for each chunk.
So, if you have size limitation in uploading or importing the mbox file, you can try the script below to split the mbox file into chunks with specified size*.
Save the script below to a text file, e.g. mboxsplit.txt, in the directory that contains the mbox file (e.g. named mbox):

BEGIN{chunk=0;filesize=0;}
    /^From /{
    if(filesize>=40000000){#file size per chunk in byte
        close("chunk_" chunk ".txt");
        filesize=0;
        chunk++;
    }
  }
  {filesize+=length()}
  {print > ("chunk_" chunk ".txt")}

And then run/type this line in that directory (contains the mboxsplit.txt and the mbox file):

  awk -f mboxsplit.txt mbox

Please note:

  • The size of the result may be larger than the defined size. It depends on the last email size inserted into the buffer/chunk before checking the chunk size.
  • It will not split the email body
  • One chunk may contain only one email if the email size is larger than the specified chunk size

I suggest you to specify the chunk size less or lower than the maximum upload/import size.

Community
  • 1
  • 1
Oki Erie Rinaldi
  • 1,835
  • 1
  • 22
  • 31
  • If I use this, is there a way to re-combine the split files? Thanks – Ycon Jun 26 '17 at 03:22
  • of course, you can combine them! remember that the split files are text files which can be simply combined. – Oki Erie Rinaldi Jul 04 '17 at 08:32
  • It did work perfectly with the huge mbox files generated by Google Takeout. I use them to import mail in Horde (GoDaddy email accounts). I just changed the size of the chunk mbox files from 40 MB to 100 MB. – abiyi May 18 '18 at 12:14
  • 1
    Excellent answer. So fast as well. – jangeador Feb 12 '19 at 15:25
  • 3
    an improvement to consider: use sprintf to zero-pad the file name indexes. Something like: `{print > ("chunk_" sprintf("%03d",chunk) ".txt");}` – Sojoodi Nov 21 '19 at 20:51
  • Fantastic solution, thank you! This came in handy for the 255mb import upload limit on roundcube – CodeUK May 18 '20 at 00:52
  • Second @Sojoodi 's comment. OP should modify the original script or mention this modification in the answer. – siliconpi Nov 25 '22 at 04:53
15

If your mbox is in standard format, each message will begin with From and a space:

From someone@somewhere.com

So, you could COPY YOUR MBOX TO A TEMPORARY DIRECTORY and try using awk to process it, on a message-by-message basis, only splitting at the start of any message. Let's say we went for 1,000 messages per output file:

awk 'BEGIN{chunk=0} /^From /{msgs++;if(msgs==1000){msgs=0;chunk++}}{print > "chunk_" chunk ".txt"}' mbox

then you will get output files called chunk_1.txt to chunk_n.txt each containing up to 1,000 messages.

If you are unfortunate enough to be on Windows (which is incapable of understanding single quotes), you will need to save the following in a file called awk.txt

BEGIN{chunk=0} /^From /{msgs++;if(msgs==1000){msgs=0;chunk++}}{print > "chunk_" chunk ".txt"}

and then type

awk -f awk.txt mbox
Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
  • how to make sure theyre about < 40 mb each? – Alex Jan 23 '15 at 15:27
  • Try with 10,000 messages and if the files are too big, remove the `chunk` files and increase the 10,000 to 20,000 before running again. It is not scientific but I guess you don't have to do it every day, so you may need to experiment a bit. – Mark Setchell Jan 23 '15 at 15:37
  • can i run this right in the console? and mbox is the file url? – Alex Jan 23 '15 at 15:41
  • If you haven't got the `mbox` file on your local machine, you will need to download it first. You can use `curl "http://yourprovider/somepath/mbox" > mbox.local or `FTP` or click some link that your provider gives you. – Mark Setchell Jan 23 '15 at 15:44
  • of course i have it :) i was just wondering what `mbox` is in your script – Alex Jan 23 '15 at 16:11
  • so do i make a sh file of that or what? – Alex Jan 23 '15 at 16:12
  • when i make a `.sh` file of it, i get an error: `awk: syntax error at source line 2 context is /^From / >>> {msgs++;if((msgs==1000){ <<< awk: illegal statement at source line 2 awk: illegal statement at source line 2 missing )` – Alex Jan 23 '15 at 16:17
  • 1
    I have simplified it - just copy the one line and paste it into the Terminal and press `Enter` – Mark Setchell Jan 23 '15 at 16:18
  • copying and executing the line in the terminal gives me this: `awk: illegal statement at source line 1` – Alex Jan 23 '15 at 16:39
  • It expects a file called `mbox` in the directory where you are running the command. Type `ls -l mbox` and see if you can see the `mbox` file. – Mark Setchell Jan 23 '15 at 16:54
  • You aren't on Windows are you? – Mark Setchell Jan 23 '15 at 16:54
  • LOL no im not :D on a mbp, osx 10.10 – Alex Jan 23 '15 at 16:58
  • i renamed the file to just mbox, `ls -lah` lists it, just like `ls -l mbox` but still the command throws a syntax error. is there a missing ( or something? – Alex Jan 23 '15 at 17:01
  • 2
    nope, still the same error: `awk: illegal statement at source line 1` i even started to edit your answer to make sure that i didnt copy unwanted chars. there just seems to be something wrong. i have no clue, but to me `print > "chunk_" chunk ".txt"` looks somehow weird, is this correct syntax? – Alex Jan 23 '15 at 17:04
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/69464/discussion-between-mark-setchell-and-alex). – Mark Setchell Jan 23 '15 at 17:05
  • Is there a way of re-combining these files back into a single `.mbox`? – Ycon Jun 26 '17 at 03:25
  • @Ycon Sure, just loop through all the chunks and use `cat` to append them together in a new `mbox` file. – Mark Setchell Jun 26 '17 at 05:25
3

formail is perfectly suited for this task. You may look at formail's +skip and -total options

Options
...
+skip
Skip the first skip messages while splitting.
-total
Output at most total messages while splitting.

Depending on the size of your mailbox and mails, you may try

formail -100 -s <google.mbox >import-01.mbox
formail +100 -100 -s <google.mbox >import-02.mbox
formail +200 -100 -s <google.mbox >import-03.mbox

etc.

The parts need not be of equal size, of course. If there's one large e-mail, you may have only formail +100 -60 -s <google.mbox >import-02.mbox, or if there are many small messages, maybe formail +100 -500 -s <google.mbox >import-02.mbox.

To look for an initial number of mails per chunk, try

formail -100 -s <google.mbox | wc
formail -500 -s <google.mbox | wc
formail -1000 -s <google.mbox | wc

You may need to experiment a bit, in order to accommodate to your mailbox size. On the other hand, since this seems to be a one time task, you may not want to spend too much time on this.

Olaf Dietsche
  • 72,253
  • 8
  • 102
  • 198
0

My initial thought was to use the other script (formail) to save each mail as a single file and afterwards run a script to combine them to 40 MB huge files, but still I wouldnt know how to do this using the terminal.

If I understand you correctly, you want to split the files up, then combine them into a big file before importing them. That sounds like what split and cat were meant to do. Split splits the files based upon your size specification whether based upon line or bytes. It then adds a suffix to these files to keep them in order, You then use cat to put the files back together:

$ split -b40m -a5 mbox  # this makes mbox.aaaaa, mbox.aaab, etc.

Once you get the files on the other system:

$ cat mbox.* > mbox

You wouldn't do this if you want to break the files so messages aren't split between files because you are going to import each file into the new mail system one at a time.

David W.
  • 105,218
  • 39
  • 216
  • 337
  • not quite, i thought `formail` was a good idea to export each email in an own textfile and from those, create chunks that are about < 40 mb so that i can import them. because split might split the file maybe right in the middle of an email so that i would not get imported correctly – Alex Jan 23 '15 at 15:29
  • Split will split an email into two separate files. But, you made it sound like you're recombining the files before importing. If that's the case, it doesn't matter that `split` split up an individual email because `cat` would have put patched them back together. `split` always splits on lines. – David W. Jan 23 '15 at 20:33