0

So, I'm parsing a .mozeml file from Eudora and converting them into an mbox file (mbox got corrupted, and deleted but mozeml files were left over, but unable to import them). There's over 200,000 e-mails, and unsure of what's a good way to handle this properly.

I am thinking of creating a Java program that will read the .mozeml files (they are xml, utf-8 format) parse the data, and then write an mbox file in this format http://en.wikipedia.org/wiki/Mbox#Family.

The problem is just that the xml file didn't separate the To line and the message; it's just one entire string. I'm not entirely sure how to properly handle that.

For example here is how the message looks

    "Joe 1" <joe1@gmail.com>joe2@gmail.comHello this is an e-mail...

or

    "Joe 1" <joe1@gmail.com>"Joe 2" <joe2@gmail.com>Hello this is an e-mail...

There's a lot of test cases to check if it's a .com/.net/com.hk/.co.jp/etc. for the first one. The second one is a bit easier because the end of the to line is >. So, I'm unsure about the first case and ensuring that it's going to be accurate for the 200,000 emails.

WakanaS
  • 83
  • 5

4 Answers4

1

Try antlr library for parsing strings.

Ewen
  • 1,008
  • 2
  • 16
  • 24
0

The first thought for this problem is to use regexp and scanner to find next email occurence in cycle.

class EmailScanner {
    public static void main(String[] args) {
        try {
            Scanner s = new Scanner(new File(/* Your file name here. */););
            String token;
            do {
                token = s.findInLine(/* Put your email pattern here. */);
                /* Write your token where you need it. */
            } while (token != null);
        } catch (Exception e) { 
            e.printStackTrace(); 
        }
    }
}

Possible email patterns can be found easily. For example ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6}$ or ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.(?:[a-zA-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)$ see http://www.regular-expressions.info/email.html.

Vic
  • 1,778
  • 3
  • 19
  • 37
0

If you know what all the domain suffixes are, you can do this with some regex-fu:

[a-zA-Z_\.0-9]+@[a-zA-Z_\.0-9]+\.(com|edu|org|net|us|tv|...)

You can find a list of top level domain names here: http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains

The full regex, I believe, should be this:

[a-zA-Z_\.0-9\-]+@[a-zA-Z_\.0-9\-]+\.(.aero|.asia|.biz|.cat|.com|.coop|.info|.int|.jobs|.mobi|.museum|.name|.net|.org|.pro|.tel|.travel|.xxx|.edu|.gov|.mil|.ac|.ad|.ae|.af|.ag|.ai|.al|.am|.an|.ao|.aq|.ar|.as|.at|.au|.aw|.ax|.az|.ba|.bb|.bd|.be|.bf|.bg|.bh|.bi|.bj|.bm|.bn|.bo|.br|.bs|.bt|.bv|.bw|.by|.bz|.ca|.cc|.cd|.cf|.cg|.ch|.ci|.ck|.cl|.cm|.cn|.co|.cr|.cs|.cu|.cv|.cx|.cy|.cz|.dd|.de|.dj|.dk|.dm|.do|.dz|.ec|.ee|.eg|.eh|.er|.es|.et|.eu|.fi|.fj|.fk|.fm|.fo|.fr|.ga|.gb|.gd|.ge|.gf|.gg|.gh|.gi|.gl|.gm|.gn|.gp|.gq|.gr|.gs|.gt|.gu|.gw|.gy|.hk|.hm|.hn|.hr|.ht|.hu|.id|.ie|.il|.im|.in|.io|.iq|.ir|.is|.it|.je|.jm|.jo|.jp|.ke|.kg|.kh|.ki|.km|.kn|.kp|.kr|.kw|.ky|.kz|.la|.lb|.lc|.li|.lk|.lr|.ls|.lt|.lu|.lv|.ly|.ma|.mc|.md|.me|.mg|.mh|.mk|.ml|.mm|.mn|.mo|.mp|.mq|.mr|.ms|.mt|.mu|.mv|.mw|.mx|.my|.mz|.na|.nc|.ne|.nf|.ng|.ni|.nl|.no|.np|.nr|.nu|.nz|.om|.pa|.pe|.pf|.pg|.ph|.pk|.pl|.pm|.pn|.pr|.ps|.pt|.pw|.py|.qa|.re|.ro|.rs|.ru|.rw|.sa|.sb|.sc|.sd|.se|.sg|.sh|.si|.sj|.sk|.sl|.sm|.sn|.so|.sr|.ss|.st|.su|.sv|.sy|.sz|.tc|.td|.tf|.tg|.th|.tj|.tk|.tl|.tm|.tn|.to|.tp|.tr|.tt|.tv|.tw|.tz|.ua|.ug|.uk|.us|.uy|.uz|.va|.vc|.ve|.vg|.vi|.vn|.vu|.wf|.ws|.ye|.yt|.yu|.za|.zm|.zw)

Of course, I'm not sure if that's a complete list of TLDs, and I know ICANN recently started allowing custom TLDs, but this should catch the vast majority of the email addresses.

rmehlinger
  • 1,067
  • 1
  • 8
  • 23
  • There was a post about email validation (of course it can be used for searchign as well)http://stackoverflow.com/questions/201323/using-a-regular-expression-to-validate-an-email-address but the standards-compliant regexp looks fare more difficult: http://ex-parrot.com/~pdw/Mail-RFC822-Address.html. – Vic Aug 09 '12 at 19:23
  • Bloody hell, that's crazily complex. Still, I think my regex should work for the vast majority of email addresses, assuming I didn't miss any legitimate characters. – rmehlinger Aug 10 '12 at 00:31
  • Yes, it's really crazy. And it uses quite unusual regexp syntax as for me. On the other hand, it's very big, and my feeling is, that it will be very slow. Trying something that we suggested might be more useful. – Vic Aug 10 '12 at 05:38
0

Here's a standard email regex modified for your format:

Pattern pattern = Pattern.compile(";[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}");
String text = "\"Joe 1\" <joe1@gmail.com>joe2@gmail.com Hello this is an e-mail...";
Matcher matcher = pattern.matcher(text);

while (matcher.find()) {
    System.out.println(matcher.group().replaceFirst(";", ""));
}

It's not going to work if, as in your first example, the email runs directly into the message (joe2@gmail.comHello this), and it assumes your email addresses always begin with ;. You can put other delimiters in there, though.

davidfmatheson
  • 3,539
  • 19
  • 27