Extracting email addresses, phone numbers using Stanford CoreNLP

Question

I have been looking for a solution to extract email addresses, phone numbers, ... from a text using Stanford CoreNLP (RegexNERAnnotator). Can anyone please provide any example?

UPDATE : 04/11/2015: Actually i should asked instead if there is a way Stanford RegexNERAnnotator can supports Java Regular expression.

Example Usage:

       final String EMAIL_PATTERN = 
            "^[_A-Za-z0-9-\\+]+(\\.[_A-Za-z0-9-]+)*@"
            + "[A-Za-z0-9-]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})$";

       List<CoreLabel> tokens = ...;
       TokenSequencePattern pattern = TokenSequencePattern.compile(EMAIL_PATTERN);
       TokenSequenceMatcher matcher = pattern.getMatcher(tokens);

       while (matcher.find()) {
         String matchedString = matcher.group();
         List<CoreMap> matchedTokens = matcher.groupNodes();
         ...
       }

It seems that it doesn't support Java Regular expression:

Exception in thread "main" edu.stanford.nlp.ling.tokensregex.parser.TokenMgrError: Lexical error at line 1, column 1.  Encountered: "^" (94), after : ""
    at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParserTokenManager.getNextToken(TokenSequenceParserTokenManager.java:1029)
    at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.jj_ntk(TokenSequenceParser.java:3228)
    at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.SeqRegexBasic(TokenSequenceParser.java:784)
    at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.SeqRegexDisjConj(TokenSequenceParser.java:973)
    at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.SeqRegex(TokenSequenceParser.java:743)
    at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.SeqRegexWithAction(TokenSequenceParser.java:1596)
    at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.parseSequenceWithAction(TokenSequenceParser.java:37)
    at edu.stanford.nlp.ling.tokensregex.TokenSequencePattern.compile(TokenSequencePattern.java:186)
    at edu.stanford.nlp.ling.tokensregex.TokenSequencePattern.compile(TokenSequencePattern.java:169)

score 5 · Accepted Answer · edited May 23 '17 at 12:15

5

StackOverflow is not a place for tutorials, or even examples. But, it seems like a regular regex should work, even without needing RegexNER. From a bit of Googling, see Using a regular expression to validate an email address for emails. Phone numbers should be as easy as the following long, but straightforward regex:

(\+[0-9]{1,2}(\s*|-)?)?(\(?[0-9]{3}\)?)?(\s*|-)[0-9]{3}(\s*|-)[0-9]{4}

My guess is that the tokenization from the Stanford Tokenizer would make this harder and not easier.

edited May 23 '17 at 12:15

Community

1
1

answered Nov 03 '15 at 19:50

Gabor Angeli

5,729
1
18
29

The answer you linked says that you can't use regular expressions for email. – Reactormonk Nov 03 '15 at 19:52
3

Indeed -- but that also means you likely don't want to use TokensRegex for those cases. If you absolutely need to capture every valid email address, you're stuck implementing the complete spec. Otherwise, a regexp will likely catch 99.9% of the cases you see. – Gabor Angeli Nov 04 '15 at 08:26
Thank you. So it seems that using Java Regular Expression, in my case, to extract emails and phone numbers will be more easy than using Stanford RegexNERAnnotator. – Ahmed MANSOUR Nov 04 '15 at 11:27

Extracting email addresses, phone numbers using Stanford CoreNLP

1 Answers1