0

I would like to capture groups based on a consecutive occurrence of matched groups in any order. And when one set type is repeated without the alternative set type, the alternative set is returned as nil.

I am trying to extract names and emails based on the following regex:

For names, two consecutive capitalized words:

[A-Z][\w]+\s+[A-Z][\w]+

For emails:

\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b

Example text:

John Doe john@doe.com random text
Jane Doe random text jane@doe.com
jim@doe.com  more random text tim@doe.com Tim Doe

So far I have used non-capture groups and positive look aheads to tackle the "in-no-particular-order-or-even-present" problem but only managed to do so by segmenting by newlines. So my regex looks like this:

^(?=(?:.*([A-Z][\w]+\s+[A-Z][\w]+))?)(?=(?:.*(\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b))?).*

And the results miss items where there are multiple contacts on the same line:

[
  ["John Doe", "john@doe.com"],
  ["Jane Doe", "jane@doe.com"],
  ["Tim Doe", "tim@doe.com"],
]

When what I'm looking for is:

[
  ["John Doe", "john@doe.com"],
  ["Jane Doe", "jane@doe.com"],
  [nil, "jim@doe.com"],
  ["Tim Doe", "tim@doe.com"],
]

My skills in regex are limited and I started using regex because it seemed like the best tool for matching names and emails.

Is regex the best tool to use for this kind of problem or are there more efficient alternatives using loops if we're extracting hundreds of contacts in this manner?

dchun
  • 91
  • 1
  • 9
  • 1
    Capturing email addresses takes a much more sophisticated pattern than that, as addresses aren't necessarily in a `name@host.domain` format. Preexisting patterns exist, so search for those rather than write your own. Scanning text for email is no guarantee that the addresses are valid either, just that they matched the pattern. If you really want valid addresses ask your user for it then send it an email asking for a response to validate it. – the Tin Man Feb 28 '20 at 00:27
  • 1
    Grabbing a person's name is impossible if the data format is random. People can have one word names, or multi-word names, they can be hyphenated, contain periods, etc. Again, the best way is to ask them what they preferred to be addressed as and go with that. It _might_ help if you explain what you're trying to do. – the Tin Man Feb 28 '20 at 00:30
  • "[How to validate an email address using a regular expression?](https://stackoverflow.com/q/201323/128421)" is a good discussion, along with the "Linked" questions on the right side of that page. And https://www.regular-expressions.info/email.html might help. – the Tin Man Feb 28 '20 at 00:40
  • 1
    To illustrate @theTinMan's point, see the wonderful article [Falsehoods Programmers Believe About Names (with examples)](https://shinesolutions.com/2018/01/08/falsehoods-programmers-believe-about-names-with-examples/). – Amadan Feb 28 '20 at 03:43
  • I'm not looking for a perfect implementation of capturing names and emails. I'm aware that there can be a few false positives, especially for names. – dchun Feb 28 '20 at 04:26
  • Rather than try to do it all in one pattern, I'd write three separate ones, if there are three separate types of possible input lines, and then use `|` between them to allow the engine to look at all three. _BUT_, proper names and the possible variations of an email address are going to regularly throw a wrench into your grabbing addresses. Human Interface people will tell you to NOT do this as it's insulting, seriously... no ABSOLUTELY insulting to your possible clients (or "targets") when you completely blow their names. – the Tin Man Feb 28 '20 at 05:41

2 Answers2

2

Your text is already almost too random to make this work. Even more names and emails are very difficult to capture at times. A more advanced email pattern would only help a little.There are not only unusual email addresses there are also all sorts of wild name patterns.
What about D'arcy Bly, Markus-Anthony Reid, Lee Z, and those are probably the simplest examples.

So, you have to make a lot of assumptions and won't be fully satisfied unless you are using more advanced techniques like Natural language processing.

If you insist on your approach, I came up with this (toothless) monstrosity:

([A-Z]\w+ [A-Z]\w+)(?:\w* )*([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})|
([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})(?:\w* )*([A-Z]\w+ [A-Z]\w+)|
([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})

The order of the alternation groups is important to be able to capture the stray email.

Demo

PS: The demo I uses a branch reset to capture only in group 1 and 2. However, it looks like Ruby 2.x does not support branch reset groups. So, you need to check all 5 groups for values.

wp78de
  • 18,207
  • 7
  • 43
  • 71
  • 2
    Ruby does not use PCRE, but Onigmo. If you're using online regexp testers, you should use [Rubular](https://rubular.com/) when dealing with Ruby, not regex101. – Amadan Feb 28 '20 at 03:26
  • 1
    @Amadan Thanks, makes sense. I often fear those other regex tester sites keep the samples only for a short while. – wp78de Feb 28 '20 at 04:14
  • I agree that the possibilities could be too random to find names. Simple email addresses are easy to pick out but that's the extent of it. And, while the email pattern will work with simple ones, toss in a smattering of UUCP and old mainframe addresses and it'll blow chunks. – the Tin Man Feb 28 '20 at 05:19
2

Here's a rewrite of @wp78de's idea into Ruby regexp syntax:

regexp = /
    (?<name>
      [A-Z][\w]+\s+[A-Z][\w]+
    ){0}
    (?<email>
      \b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b
    ){0}

    (?:
      \g<name> (?:\w*\s)* \g<email>
    | \g<email> (?:\w*\s)* \g<name>
    | \g<email>
    )
/x

text = <<-TEXT
John Doe john@doe.com random text
Jane Doe random text jane@doe.com
jim@doe.com  more random text tim@doe.com Tim Doe
TEXT

p text.scan(regexp)
# => [["John Doe", "john@doe.com"],
# =>  ["Jane Doe", "jane@doe.com"],
# =>  [nil, "jim@doe.com"],
# =>  ["Tim Doe", "tim@doe.com"]]
Amadan
  • 191,408
  • 23
  • 240
  • 301
  • Interesting. It looks like Perl and Ruby allow groups with the same name or why are the different captures grouped together? – wp78de Feb 28 '20 at 04:33
  • 1
    I believe `[\w]` is the same as `\w`. I like `{0}`. Just read about that recently. – Cary Swoveland Feb 28 '20 at 04:45
  • @CarySwoveland: It is. I copied OP's definitions of "name" and "email". – Amadan Feb 28 '20 at 04:55
  • `[\w]` is the same as `\w` and there's no reason to use the first. But either way `\w` is NOT what should be used when dealing with names because, as was said in The Princess Bride, "I don't think it means what you think it does." `\w` means [`[a-zA-Z0-9_]`](https://ruby-doc.org/core-2.7.0/Regexp.html#class-Regexp-label-Character+Classes) which is NOT a word containing only ASCII letters, which should be, at the minimum `[a-zA-Z]`. Not paying attention to the details is the root of all bugs. `\w` is the standard definition of a variable name in most programming languages. – the Tin Man Feb 28 '20 at 05:05
  • 1
    @theTinMan I fully agree with all of your comments, both here and under the question. As I said above, my answer is mostly about how to translate `(|...)` construct from wp78de's answer into Onigmo, and I did not modify OP's sub-expressions. – Amadan Feb 28 '20 at 05:12
  • I have no complaint about your answer, just agree that `[\w]` is a bit superfluous. You're one of the ones... when @Amadan speaks I listen. :-) – the Tin Man Feb 28 '20 at 05:17
  • 1
    @wp78de `(?...)` is a named capture pattern. However, I let it match zero times at the start of the pattern, so it doesn't actually do anything. Onigmo can run subpatterns using `\k` construct (even recursively), so the actual capture happens there. Onigmo will overwrite the capture group if captured more than once (e.g. `(?g\w)\g` matching `"ab"` would produce `{ "ch" => "b" }`; but it is not captured more than once here (the first mention is repeated 0 times, the other mentions are in exclusive alternation). – Amadan Feb 28 '20 at 05:29
  • 1
    @theTinMan Lol.... Listen for errors? :D (I do lots of those) I feel the same with your answers, always informative! – Amadan Feb 28 '20 at 05:30