4

I have data with a sender and receiver, and number of emails sent. A toy example:

senders <- c("Mable","Beth", "Beth","Susan","Susan")
receivers <- c("Beth", "Mable", "Susan", "Mable","Beth")
num_email <- c(1,1,2,1,1)

df <- data.frame(senders, receivers, num_email)

senders receivers num_email
Mable      Beth          1
Beth       Mable         1
Beth       Susan         2
Susan      Mable         1
Susan      Beth          1

I'd like to get a data.frame that has the total messages for each unique pair. E.g. the connection Mable | Beth would have value 2, because Mable sent Beth one message, and Beth sent Mable one message. The resulting data.frame should have only one row for each unique combination of emailers (e.g. there would only be Mable | Beth or Beth | Mable, not both.

I've tried various approaches with reshape and data.table, but I'm not having any luck. I'd like to avoid creating a unique string BethMable and merging that way. Many thanks

Rocinante
  • 625
  • 6
  • 15

2 Answers2

4

We could either use the base R method by first sorting the first two columns by row. We use apply with MARGIN=1 to do that, transpose the output, convert to 'data.frame' to create 'df1', use the formula method of aggregate to get the sum of 'num_email' grouped by the first two columns of the transformed dataset.

df1 <- data.frame(t(apply(df[1:2], 1, sort)), df[3])
aggregate(num_email~., df1, FUN=sum)

#      X1    X2 num_email
# 1  Beth Mable         2
# 2  Beth Susan         3
# 3 Mable Susan         1

Or using data.table, we convert the first two columns to character class, unname to change the column names of the first two columns to the default 'V1', 'V2', and convert to 'data.table'. Using the lexicographic ordering of character columns, we create the logical index for i (V1 > V2), assign (:=) the columns that meet the condition by reversing the order of columns (.(V2, V1)), and get the sum of 'num_email' grouped by 'V1', 'V2'.

library(data.table)
dt = do.call(data.table, c(lapply(unname(df[1:2]), as.character), df[3]))
dt[V1 > V2, c("V1", "V2") := .(V2, V1)]
dt[, .(num_email = sum(num_email)), by= .(V1, V2)]

#       V1    V2 num_email
# 1:  Beth Mable         2
# 2:  Beth Susan         3
# 3: Mable Susan         1

Or using dplyr, we use mutate_each to convert the columns to character class, then reverse the order with pmin and pmax, group by 'V1', 'V2' and get the sum of 'num_email'.

library(dplyr)
df %>%
  mutate_each(funs(as.character), senders, receivers) %>%
  mutate( V1 = pmin(senders, receivers), 
          V2 = pmax(senders, receivers) ) %>%
  group_by(V1, V2) %>%
  summarise(num_email=sum(num_email))

#      V1    V2 num_email
#   (chr) (chr)     (dbl)
# 1  Beth Mable         2
# 2  Beth Susan         3
# 3 Mable Susan         1

NOTE: The data.table solution was updated by @Frank.

akrun
  • 874,273
  • 37
  • 540
  • 662
  • I would love for the data table solution to work. If I enter that code I get an error - Error in eval(expr, envir, enclos) : could not find function "." – Rocinante Feb 06 '15 at 07:00
  • @Rocinante - you need to update your `data.table` version. Or replace `.` with `list` I think. – thelatemail Feb 06 '15 at 07:01
  • @Rociante Have you tried replacing the `.` with `list` as thelatemail suggested – akrun Feb 06 '15 at 07:11
  • @akrun - I updated data.table and still get the same error. If I use list in place, as in the updated solution you posted, I get the original df back. Any ideas? – Rocinante Feb 06 '15 at 07:15
  • @Rocinante I am using the `data.table_1.9.5`,devel version. which version do you have? – akrun Feb 06 '15 at 07:17
  • @akrun data.table_1.9.2. I'll upgrade to 1.9.5 and retry. Thanks! – Rocinante Feb 06 '15 at 07:19
  • Thanks @akrun, the first solution (using agggregate) works for me. Appreciate it! – Rocinante Feb 06 '15 at 18:29
  • @akrun This question has just been a dupe target for the same question, but specific to data.table. Could you update your data.table answer to be self-contained (starting from `df`)? (I'm guessing you would want to use some different approach than `t` and `apply` with a DT, but don't know what.) I would convert to char and then do `DT[a>b, c("a","b") := .(b,a)]` and `DT[, sum(x), by=.(a,b)]`. By the way, `senders` and `receivers` make no sense as col names in the result; `person_A` and `person_B` or similar might make more sense. – Frank Oct 19 '15 at 12:10
  • @Frank Can you show the link. Or Can you update the post (as it may take more time to understand the problem)? Thanks. – akrun Oct 19 '15 at 12:13
  • 1
    http://stackoverflow.com/questions/31675416/summarise-data-by-removing-duplicates?lq=1 I'll have a go at updating the post right now, sure. – Frank Oct 19 '15 at 12:26
  • Let me know if you have a problem with the edit and I'll try to explain. You can overrule/reverse it, of course. – Frank Oct 19 '15 at 12:38
  • @Frank Thanks for updating. The second line is classy `dt[V1 > V2, c("V1", "V2") := .(V2, V1)]`. How does the `V1 >V2` work when they are both 'character' columns? – akrun Oct 19 '15 at 12:47
  • 1
    Thanks. Character column comparisons use lexicographic ordering https://en.wikipedia.org/wiki/Lexicographical_order pretty much the same as alphabetical, but also including other chars. – Frank Oct 19 '15 at 12:50
0

Another solution:

senders <- c("Mable","Beth", "Beth","Susan","Susan")
receivers <- c("Beth", "Mable", "Susan", "Mable","Beth")
num_email <- c(1,1,2,1,1)

df <- data.frame(senders, receivers, num_email)

# finding unique users
users <- unique(c(senders, receivers))
# generate combinations without repetitions
user_combi <- gtools::combinations(v=users, n=length(users), r=2)

# count the number of mails for each combination
counts <- apply(user_combi, MARGIN=1, FUN=function(x) 
                     sum(df$num_email[ (df$senders %in% x) & (df$receivers %in% x)])
               )

# wrap up in a data.frame
df2 <- data.frame(user_combi, counts)

This gives:

> df2
     X1    X2 counts
1  Beth Mable      2
2  Beth Susan      3
3 Mable Susan      1
asachet
  • 6,620
  • 2
  • 30
  • 74