split column based on 2 columns

Question

I have a table in the folowing format:

      t_id      date1      date2 date3 email
 100678318 2016-09-05       <NA>  <NA>  natas@gmail.com
 100678319       <NA> 2016-10-05  <NA>  natas@gmail.com
 100732587 2016-11-01       <NA>  <NA>  1000988@boerman.nl
 100689822 2016-09-18       <NA>  <NA>  10@line.nl
 100640340 2016-08-01       <NA>  <NA>  1111tk68@net.nl
 100641415       <NA> 2016-08-02  <NA>  1111tk68@net.nl

Now i want to change the data to a different format. (wide) To eleborate a bit more emails should be grouped to 1 row. And if we have a t_id which occurs more then once we want them to end up like t_id_1 date1 t_id_2 date2 and so forth.

So that the table looks like this (only first record for the example):

email             t_id_1      date1        t_id_2      date2      t_id_3 date3
natas@gmail.com   100678318   2016-09-05   100678319   2016-10-05   NA   NA

So probably i need some conditional formatting or something. I was hoping for a solution with Dpylr and plyr.

Try from other questions:

library(data.table)
tst <- setDT(tstDF)[, lapply(.SD, function(x) toString(na.omit(x))), by = t_id]

Hopefully there is someone with an solution for this problem.

You want a wide format, not long format. maybe [this post](http://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format) will be helpful. — lmo, Nov 16 '16 at 14:08

score 2 · Answer 1 · answered Nov 16 '16 at 14:22

I would coalesce into one date variable, then create a counter for each email and then use reshape. This assumes data is sorted by email.

library(reshape2)

coalesce <- function(...) {
  apply(cbind(...), 1, function(x) x[which(!is.na(x))[1]])
}

df$date <- as.Date(coalesce(df$date1, df$date2, df$date3), origin = '1970-01-01')
df$id <- 1
for (i in 2:nrow(df)) {
  if (df$email[i] == df$email[i - 1]) {
    df$id[i] <- df$id[i] + 1
  }
}

reshape(df[ c('id', 'date', 't_id', 'email')], idvar = 'email', timevar = 'id', direction = 'wide')

Nice solution indeed, but this only gives me the first 2 t_id's and dates. In my full DF i have around 8 transactions on average. Could you maybe eleborate the for loop a bit more so that i might change the code to my requirements? — Sander Van der Zeeuw, Nov 16 '16 at 15:25

G. Grothendieck · Accepted Answer · 2016-11-16T15:53:10.717

The details of t_id and date are not described in the question so in (1) we have assumed that there are up to 3 t_id values per email and that they appear in the order corresponding to date1, date2 and date3 respectively with all other date values being NA. For example, if there are 2 t_id values for a particular email then the first one will have date1 as a date and date2 and date3 be NA. The second will have date2 as a date and date1 and date3 will be NA. In (2) we assume the same except we generalize to k instead of 3.

No packages are used.

1) Use by to split on email and then manually construct the row for each. Finally rbind the rows together.

do.call("rbind", 
  by(DF, DF$email, function(x) {
    t_id <- c(x$t_id, NA, NA, NA)[1:3]
    date <- c(na.omit(c(x$date1, x$date2, x$date3)), NA, NA, NA)[1:3]
    data.frame(email = x$email[1], 
               t_id1 = t_id[1], date1 = date[1],
               t_id2 = t_id[2], date2 = date[2],
               t_id3 = t_id[3], date3 = date[3]
    )
  }
))

giving:

                                email     t_id1      date1     t_id2      date2
10@line.nl                 10@line.nl 100689822 2016-09-18        NA       <NA>
1000988@boerman.nl 1000988@boerman.nl 100732587 2016-11-01        NA       <NA>
1111tk68@net.nl       1111tk68@net.nl 100640340 2016-08-01 100641415 2016-08-02
natas@gmail.com       natas@gmail.com 100678318 2016-09-05 100678319 2016-10-05
                   t_id3 date3
10@line.nl            NA  <NA>
1000988@boerman.nl    NA  <NA>
1111tk68@net.nl       NA  <NA>
natas@gmail.com       NA  <NA>

2) If desired we could generalize this to up to k dates and t_id values. In this case rbind/by produces a new data frame long which has k rows for each email. The first row for each email in long corresponds to the first tid and date and so on up to the kth. long is subsequently reshaped to wide.

is.date <- grepl("date", names(DF))
k <- sum(is.date)

long <- do.call("rbind", 
  by(DF, DF$email, function(x)
    data.frame(email = x$email[1], 
       time = 1:k,
       t_id = c(x$t_id,  rep(NA, k))[1:k],
       date = c(na.omit(do.call("c", x[is.date])), rep(NA, k))[1:k]
    )
  )
)
reshape(long, dir = "wide", idvar = "email")

giving:

                                    email    t_id.1     date.1    t_id.2     date.2 t_id.3 date.3
10@line.nl.1                   10@line.nl 100689822 2016-09-18        NA       <NA>     NA   <NA>
1000988@boerman.nl.1   1000988@boerman.nl 100732587 2016-11-01        NA       <NA>     NA   <NA>
1111tk68@net.nl.date11    1111tk68@net.nl 100640340 2016-08-01 100641415 2016-08-02     NA   <NA>
natas@gmail.com.date11    natas@gmail.com 100678318 2016-09-05 100678319 2016-10-05     NA   <NA>

Note: The input DF in reproducible form was assumed to be:

Lines <- "t_id      date1      date2 date3 email
 100678318 2016-09-05       <NA>  <NA>  natas@gmail.com
 100678319       <NA> 2016-10-05  <NA>  natas@gmail.com
 100732587 2016-11-01       <NA>  <NA>  1000988@boerman.nl
 100689822 2016-09-18       <NA>  <NA>  10@line.nl
 100640340 2016-08-01       <NA>  <NA>  1111tk68@net.nl
 100641415       <NA> 2016-08-02  <NA>  1111tk68@net.nl"

DF <- transform(read.table(text = Lines, header = TRUE, na.strings = "<NA>"),
          date1 = as.Date(date1),
          date2 = as.Date(date2),
          date3 = as.Date(date3))

Thanks for the very extensive elaboration. I will start trying! Maybe i will come back for some extra elaboration if you want to help me. — Sander Van der Zeeuw, Nov 16 '16 at 15:47

split column based on 2 columns

2 Answers2