1

I have dataframe which has column called email. I want to find email addresses after @ symbol and then group by e.g (gmail,yahoo,hotmail) and count the occurrences of the same.

 registrant_email 
 chamukan@yahoo.com     
 tmrsons1974@yahoo.com     
 123ajumohan@gmail.com     
 123@websiterecovery.org     
 salesdesk@2techbrothers.com       
 salesdesk@2techbrothers.com   

Now I can extract emails after @ using below code

sub(".*@", "", df$registrant_email)

How can I use it in dplyr pipe and then count occurrences of each email address

Neil
  • 7,937
  • 22
  • 87
  • 145
  • @nrussell My question is different – Neil Aug 20 '16 at 22:13
  • That naive email parsing will come back to haunt you some day. http://haacked.com/archive/2007/08/21/i-knew-how-to-validate-an-email-address-until-i.aspx/ – hrbrmstr Aug 21 '16 at 01:50

3 Answers3

3

tidyr::separate is useful for splitting columns:

library(tidyr)
library(dplyr)

       # separate email into `user` and `domain` columns
df %>% separate(registrant_email, into = c('user', 'domain'), sep = '@') %>% 
    # tally occurrences for each level of `domain`
    count(domain)

## # A tibble: 4 x 2
##                domain     n
##                 <chr> <int>
## 1   2techbrothers.com     2
## 2           gmail.com     1
## 3 websiterecovery.org     1
## 4           yahoo.com     2
alistaire
  • 42,459
  • 4
  • 77
  • 117
  • One thing missing is you didn't do a group-wise count. Otherwise, it is the idiomatic solution. – shayaa Aug 21 '16 at 00:52
  • I took the question as asking how often domains are repeated, not whole emails, which just requires `df %>% count(registrant_email)`, not splitting. To split and then count emails, you just have to change it to `count(user, domain)`, but it seems like wasted effort without more context. – alistaire Aug 21 '16 at 01:05
1

By first splitting into a character matrix, after coercing to data.frame, we can use common dplyr idioms

library(dplyr)
library(stringr) 

str_split_fixed(df$registrant_email, pattern = "@", n =2) %>%
  data.frame %>% group_by(X2) %>% count(X1)

The result is as follows

                   X2          X1     n
               <fctr>      <fctr> <int>
1   2techbrothers.com   salesdesk     2
2           gmail.com 123ajumohan     1
3 websiterecovery.org         123     1
4           yahoo.com    chamukan     1
5           yahoo.com tmrsons1974     1

If you want to set variable names for better code comprehension, you can use

str_split_fixed(df$registrant_email, pattern = "@", n =2) %>%
  data.frame %>% setNames(c("local", "domain")) %>% 
  group_by(domain) %>% count(local)
shayaa
  • 2,787
  • 13
  • 19
1

We can use base R methods for this

aggregate(V1~V2, read.table(text = df1$registrant_email, 
          sep="@", stringsAsFactors=FALSE), FUN = length)
#                   V2 V1
#1   2techbrothers.com  2
#2           gmail.com  1
#3 websiterecovery.org  1
#4           yahoo.com  2

Or using the OP's method and wrap it with table

as.data.frame(table(sub(".*@", "", df1$registrant_email)))
#                 Var1 Freq
#1   2techbrothers.com    2
#2           gmail.com    1
#3 websiterecovery.org    1
#4           yahoo.com    2
akrun
  • 874,273
  • 37
  • 540
  • 662