1

I would like to validate (select) db with the appropriate email format.

SO related post here

Example: Selecting the appropriate format:

example.data <- c("tint@tint.com","mailto:tint@tint.com","@tint.com",
"tint@","tint.tint.com",
"orange.com","orange@orange","orange@orange.com",
"e-mail: k-supra@k-supra.com","mailto:%20k-supra@k-supra.com")

desired.out <- c("tint@tint.com","mailto:tint@tint.com","orange@orange.com",
    "k-supra@k-supra.com","k-supra@k-supra.com")

Would someone share working solution? Thanks.

Community
  • 1
  • 1
Maximilian
  • 4,177
  • 7
  • 46
  • 85
  • 1
    I posted a solution but this is solely based on the pattern showed in the example – akrun May 21 '15 at 11:29
  • 1
    orange@orange is a valid email address syntax. Otherwise you were not able to write e.g. master@localhost. Here's a good discussion of the topic http://www.regular-expressions.info/email.html – Peter Paul Kiefer May 21 '15 at 11:37
  • @Peter; yes indeed, thanks for that! I'm going to have a look on that. – Maximilian May 21 '15 at 11:45
  • @Maximilian There are also solutions provided there. I forgott ;-) – Peter Paul Kiefer May 21 '15 at 11:47
  • I think in the example.data you have `e-mail: k-supra@k-supra.com"` and the desired output is different – akrun May 21 '15 at 11:47
  • @akrun: I don't see difference. Must be typo. – Maximilian May 21 '15 at 11:50
  • What I meant is that if you are selecting a substring of `email: k-supra...` then why did you omit `"mailto:tint@tint.com"` – akrun May 21 '15 at 11:56
  • @akrun: yes you are right. I thought about selecting+cleaning :) In case like: `"mailto:%20k-supra@k-supra.com"` is going to be difficult, since not sure where to cut it off 20k-... or k-...., anyway, selecting with `@` would be great for now on. Thank you! – Maximilian May 21 '15 at 12:01
  • `grep('^[^@]+@[^@]+\\.[^.]+$', example.data, value=TRUE)` gets the elements in the desired output – akrun May 21 '15 at 12:03
  • I updated with a `sub` step to clean the email. Check if that helps – akrun May 21 '15 at 12:16

1 Answers1

2

You can try

 v1 <- grep('^[^@]+@[^@]+\\.[^.]+$', example.data, value=TRUE)
 v1
 #[1] "tint@tint.com"                 "mailto:tint@tint.com"         
 #[3] "orange@orange.com"             "e-mail: k-supra@k-supra.com"  
 #[5] "mailto:%20k-supra@k-supra.com"

To clean the strings, may be

 sub('^[^:]+:( |%\\d+)?', '', v1)
 #[1] "tint@tint.com"       "tint@tint.com"       "orange@orange.com"  
 #[4] "k-supra@k-supra.com" "k-supra@k-supra.com"


 grep('^[^@]+@[^@]+\\.[^.]+$', 'bill.gates@outlook.com', value=TRUE)
 #[1] "bill.gates@outlook.com"
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    This wouldn't capture cases like "bill.gates@outlook.com" because of the "." preceeding the @. – talat May 21 '15 at 11:32
  • 1
    @Maximillian, You could change it depending upon what characters will be present on either side of `@` and after `.` – akrun May 21 '15 at 11:33
  • 2
    `grep('^[^@:]+@[^@]+\\.[^.]+$', example.data, value=TRUE)` may be an alternative – akrun May 21 '15 at 11:35
  • @akrun: yes the last suggestion is way better, capturing `"bill.gates@outlook.com"`, which is a must. Thanks. – Maximilian May 21 '15 at 11:37
  • @akrun: I have slightly edited my question, basically extending by two more cases? Would you be so kind and have a look? Thanks! – Maximilian May 21 '15 at 11:44