Using regex to extract email address after @ in dplyr pipe and then groupby to count occurrences

Question

I have dataframe which has column called email. I want to find email addresses after @ symbol and then group by e.g (gmail,yahoo,hotmail) and count the occurrences of the same.

 registrant_email 
 chamukan@yahoo.com     
 tmrsons1974@yahoo.com     
 123ajumohan@gmail.com     
 123@websiterecovery.org     
 salesdesk@2techbrothers.com       
 salesdesk@2techbrothers.com

Now I can extract emails after @ using below code

sub(".*@", "", df$registrant_email)

How can I use it in dplyr pipe and then count occurrences of each email address

That naive email parsing will come back to haunt you some day. http://haacked.com/archive/2007/08/21/i-knew-how-to-validate-an-email-address-until-i.aspx/ — hrbrmstr, Aug 21 '16 at 01:50

score 3 · Answer 1 · answered Aug 20 '16 at 23:13

3

tidyr::separate is useful for splitting columns:

library(tidyr)
library(dplyr)

       # separate email into `user` and `domain` columns
df %>% separate(registrant_email, into = c('user', 'domain'), sep = '@') %>% 
    # tally occurrences for each level of `domain`
    count(domain)

## # A tibble: 4 x 2
##                domain     n
##                 <chr> <int>
## 1   2techbrothers.com     2
## 2           gmail.com     1
## 3 websiterecovery.org     1
## 4           yahoo.com     2

answered Aug 20 '16 at 23:13

alistaire

42,459
4
77
117

One thing missing is you didn't do a group-wise count. Otherwise, it is the idiomatic solution. – shayaa Aug 21 '16 at 00:52
I took the question as asking how often domains are repeated, not whole emails, which just requires `df %>% count(registrant_email)`, not splitting. To split and then count emails, you just have to change it to `count(user, domain)`, but it seems like wasted effort without more context. – alistaire Aug 21 '16 at 01:05

shayaa · Accepted Answer · 2016-08-20T22:24:55.010

By first splitting into a character matrix, after coercing to data.frame, we can use common dplyr idioms

library(dplyr)
library(stringr) 

str_split_fixed(df$registrant_email, pattern = "@", n =2) %>%
  data.frame %>% group_by(X2) %>% count(X1)

The result is as follows

                   X2          X1     n
               <fctr>      <fctr> <int>
1   2techbrothers.com   salesdesk     2
2           gmail.com 123ajumohan     1
3 websiterecovery.org         123     1
4           yahoo.com    chamukan     1
5           yahoo.com tmrsons1974     1

If you want to set variable names for better code comprehension, you can use

str_split_fixed(df$registrant_email, pattern = "@", n =2) %>%
  data.frame %>% setNames(c("local", "domain")) %>% 
  group_by(domain) %>% count(local)

score 1 · Answer 3 · answered Aug 21 '16 at 03:22

We can use base R methods for this

aggregate(V1~V2, read.table(text = df1$registrant_email, 
          sep="@", stringsAsFactors=FALSE), FUN = length)
#                   V2 V1
#1   2techbrothers.com  2
#2           gmail.com  1
#3 websiterecovery.org  1
#4           yahoo.com  2

Or using the OP's method and wrap it with table

as.data.frame(table(sub(".*@", "", df1$registrant_email)))
#                 Var1 Freq
#1   2techbrothers.com    2
#2           gmail.com    1
#3 websiterecovery.org    1
#4           yahoo.com    2

Using regex to extract email address after @ in dplyr pipe and then groupby to count occurrences

3 Answers3