So, I'm parsing a .mozeml file from Eudora and converting them into an mbox file (mbox got corrupted, and deleted but mozeml files were left over, but unable to import them). There's over 200,000 e-mails, and unsure of what's a good way to handle this properly.
I am thinking of creating a Java program that will read the .mozeml files (they are xml, utf-8 format) parse the data, and then write an mbox file in this format http://en.wikipedia.org/wiki/Mbox#Family.
The problem is just that the xml file didn't separate the To line and the message; it's just one entire string. I'm not entirely sure how to properly handle that.
For example here is how the message looks
"Joe 1" <joe1@gmail.com>joe2@gmail.comHello this is an e-mail...
or
"Joe 1" <joe1@gmail.com>"Joe 2" <joe2@gmail.com>Hello this is an e-mail...
There's a lot of test cases to check if it's a .com/.net/com.hk/.co.jp/etc. for the first one. The second one is a bit easier because the end of the to line is >. So, I'm unsure about the first case and ensuring that it's going to be accurate for the 200,000 emails.