2

I have to parse a multi line string and retrieve the email addresses in a specific location.

And I have done it using the below code:

String input = "Content-Type: application/ms-tnef; name=\"winmail.dat\"\r\n"
            + "Content-Transfer-Encoding: binary\r\n" + "From: ABC aa DDD <aaaa.b@abc.com>\r\n"
            + "To: DDDDD dd <sssss.r@abc.com>\r\n" + "CC: Rrrrr rrede <sssss.rv@abc.com>, Dsssssf V R\r\n"
            + " <dsdsdsds.vr@abc.com>, Psssss A <pssss.a@abc.com>, Logistics\r\n"
            + " <LOGISTICS@abc.com>, Gssss Bsss P <gdfddd.p@abc.com>\r\n"
            + "Subject: RE: [MyApps] (PRO-34604) PR for Additional Monitor allocation [CITS\r\n"
            + " Ticket:258849]\r\n" + "Thread-Topic: [MyApps] (PRO-34604) PR for Additional Monitor allocation\r\n"
            + " [CITS Ticket:258849]\r\n" + "Thread-Index: AQHRXMJHE6KqCFxKBEieNqGhdNy7Pp8XHc0A\r\n"
            + "Date: Mon, 1 Feb 2016 17:56:17 +0530\r\n"
            + "Message-ID: <B7F84439E634A44AB586E3FF2EA0033A29E27E47@JETWINSRVRPS01.abc.com>\r\n"
            + "References: <JA.101.1453963700000@myapps.abc.com>\r\n"
            + " <JA.101.1453963700000.978.1454311765375@myapps.abc.com>\r\n"
            + "In-Reply-To: <JIRA.450101.1453963700000.978.1454311765375@myapps.abc.com>\r\n"
            + "Accept-Language: en-US\r\n" + "Content-Language: en-US\r\n" + "X-MS-Has-Attach:\r\n"
            + "X-MS-Exchange-Organization-SCL: -1\r\n"
            + "X-MS-TNEF-Correlator: <B7F84439E634A44AB586E3FF2EA0033A29E27E47@JETWINSRVRPS01.abc.com>\r\n"
            + "MIME-Version: 1.0\r\n" + "X-MS-Exchange-Organization-AuthSource: TURWINSRVRPS01.abc.com\r\n"
            + "X-MS-Exchange-Organization-AuthAs: Internal\r\n" + "X-MS-Exchange-Organization-AuthMechanism: 04\r\n"
            + "X-Originating-IP: [1.1.1.7]";

    Pattern pattern = Pattern.compile("To:(.*<([^>]*)>).*Message-ID", Pattern.DOTALL);
    Matcher matcher = pattern.matcher(input);
    while (matcher.find()) {
        Pattern innerPattern = Pattern.compile("<([^>]*)>");
        Matcher innerMatcher = innerPattern.matcher(matcher.group(1));
        while (innerMatcher.find()) {
            System.out.println("-->:" + innerMatcher.group(1));
        }
    }

Here it works fine. I'm first grouping the part from To till the Message which is the required part. And then I have another grouping to extract the email ids. Is there any better way to do this? Can we do it with one pattern matcher set?

Update: This is the expected output:

-->:sssss.r@abc.com
-->:sssss.rv@abc.com
-->:dsdsdsds.vr@abc.com
-->:pssss.a@abc.com
-->:LOGISTICS@abc.com
-->:gdfddd.p@abc.com
RamValli
  • 4,389
  • 2
  • 33
  • 45

2 Answers2

2

Ideally, you could have used lookarounds:

(?<=To:.*)<([^>]+)>(?=.*Message-ID)

Regular expression visualization

Visualization by Debuggex


Unfortunately, Java doesn't support variable length in lookbehinds. A workaround could be:

(?<=To:.{0,1000})<([^>]+)>(?=.*Message-ID)
Community
  • 1
  • 1
sp00m
  • 47,968
  • 31
  • 142
  • 252
1

I think you are looking for all the emails inside <...> that come after To: and before Message-ID. So, you may use a \G based regex for one pass:

Pattern pt = Pattern.compile("(?:\\bTo:|(?!^)\\G).*?<([^>]*)>(?=.*Message-ID)", Pattern.DOTALL);
Matcher m = pt.matcher(input);
while (m.find()) {
    System.out.println(m.group(1));
}

See IDEONE demo and a regex demo

The regex matches:

  • (?:\\bTo:|(?!^)\\G) - a leading boundary, either To: as a whole word or the location after the previous successful match
  • .*? - any characters, any number of occurrences up to the first
  • <([^>]*)> - substring starting with < followed with zero or more characters other than > (Group 1) and followed with a closing >
  • (?=.*Message-ID) - a positive lookahead that makes sure there is Message-ID somewhere ahead of the current match.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Along with this answer, [this](http://stackoverflow.com/a/35154460/2270563) answer is also helpful! – RamValli Feb 03 '16 at 09:23