2

I am trying to extract just the emails from text column in openrefine. some cells have just the email, but others have the name and email in john doe <john@doe.com> format. I have been using the following GREL/regex but it does not return the entire email address. For the above exaple I'm getting ["n@doe.com"]

value.match(
/.*([a-zA-Z0-9_\-\+]+@[\._a-zA-Z0-9-]+).*/
)

Any help is much appreciated.

Abi Hassen
  • 23
  • 2
  • 1
    @WiktorStribiżew I recommend you convert your comment into an answer – neontapir Feb 02 '18 at 22:37
  • `import re \n return re.findall(r"([a-zA-Z0-9_\-\+]+@[\._a-zA-Z0-9-]+)", value)[0]` is what i ended using successfully. thanks to @Wiktor Stribiżew and @Ettore Rizza – Abi Hassen Feb 03 '18 at 19:25

2 Answers2

0

The n is captured because you are using .* before the capturing group, and since it can match any 0+ chars other than line break chars greedily the only char that can land in Group 1 during backtracking is the char right before @.

If you can get partial matches git rid of the .* and use

/[^<\s]+@[^\s>]+/

See the regex demo

Details

  • [^<\s]+ - 1 or more chars other than < and whitespace
  • @ - a @ char
  • [^\s>]+ - 1 or more chars other than whitespace and >.

Python/Jython implementation:

import re
res = ''
m = re.search(r'[^<\s]+@[^\s>]+', value)
if m:
    res = m.group(0)
return res

There are other ways to match these strings. In case you need a full string match .*<([^<]+@[^>]+)>.* where .* will not gobble the name since it will stop before an obligatory <.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks. Unfortunately this only captures the emails from cells that only contain the email address it doesn't return anything from the text cells that have the name or other text before the email. I think this has to do with the regex implementation in GREL because both expressions work fine in javascript – Abi Hassen Feb 02 '18 at 23:24
0

If some cells contain just the email, it's probably better to use the @wiktor-stribiżew's partial match. In the development version of Open Refine, there is now a value.find() function that can do this, but it will only be officially implemented in the next version (2.9). In the meantime, you can reproduce it using Python/Jython instead of GREL:

import re
return re.findall(r"[^<\s]+@[^\s>]+", value)[0]

Result :

enter image description here

Ettore Rizza
  • 2,800
  • 2
  • 11
  • 23