0

Trying to create a new data_frame based on the order date column and emails. So if I have a duplicated email (e.g. cheers@web.com in the example below), I want to merge the emails and place the order_date variable in a new column next to it. I want to do this in the full DF. This will introduce many NAs but i will solve that problem later.

I have a dataframe as follows:

Source: local data frame [6 x 4]
Groups: email [5]

                email order_date `sum(price_excl_vat_euro)` `sum(total_qty)`
                <chr>     <date>                      <dbl>            <int>
1 whatis@web.com 2016-09-05                     140.48                2
2 myemail@web.com 2016-11-01                     41.31                1
3 whereto@web.com 2016-09-18                     61.98                1
4 cheers@web.com 2016-08-01                      61.98                1
5 cheers@web.com 2016-08-02                      61.98                1
6 hello@web.com 2016-08-02                      140.49                1

What i want to obtain is (the other columns i do not care about for now):

email            order_date1    order_date2
whatis@web.com   2016-09-05     NA
myemail@web.com  2016-11-01     NA
whereto@web.com  2016-09-18     NA
cheers@web.com   2016-08-01     2016-08-02
hello@web.com    2016-08-02     NA

It is important to know that the number of orders could vary between 1-10 (average). I tried the spread function from the tidyr package. But couldn't get it to work. Any hints are very appreciated!

Sander Van der Zeeuw
  • 1,092
  • 1
  • 13
  • 35

1 Answers1

3

For example

df <- read.table(row.names=1, stringsAsFactors = F, text="
1 whatis@web.com 2016-09-05                     140.48                2
2 myemail@web.com 2016-11-01                     41.31                1
3 whereto@web.com 2016-09-18                     61.98                1
4 cheers@web.com 2016-08-01                      61.98                1
5 cheers@web.com 2016-08-02                      61.98                1
6 hello@web.com 2016-08-02                      140.49                1")
df <- df[order(df[,1], df[,2]), ]
lst <- split(df[,2],df[,1])
do.call(rbind, lapply(lst, "length<-", max(lengths(lst))))
#                 [,1]         [,2]        
# cheers@web.com  "2016-08-01" "2016-08-02"
# hello@web.com   "2016-08-02" NA          
# myemail@web.com "2016-11-01" NA          
# whatis@web.com  "2016-09-05" NA          
# whereto@web.com "2016-09-18" NA      

or

library(tidyverse)
df %>% 
  arrange(V2, V3) %>% 
  group_by(V2) %>% 
  transmute(V3, date=paste0("date", 1:n())) %>% 
  spread(date, V3)
# Source: local data frame [5 x 3]
# Groups: V2 [5]
# 
#                V2      date1      date2
# *           <chr>      <chr>      <chr>
# 1  cheers@web.com 2016-08-01 2016-08-02
# 2   hello@web.com 2016-08-02       <NA>
# 3 myemail@web.com 2016-11-01       <NA>
# 4  whatis@web.com 2016-09-05       <NA>
# 5 whereto@web.com 2016-09-18       <NA>
lukeA
  • 53,097
  • 5
  • 97
  • 100
  • Thanks this works indeed. Can you maybe also tell me how i can keep the other columns? Guess adding them to arrange should do the job. – Sander Van der Zeeuw Nov 16 '16 at 09:10
  • Sum them up and merge them afterwards ( for example). Or: Post a new question and specify the exact desired output (which I don't know) – lukeA Nov 16 '16 at 09:54
  • Well after splitting them i have a order lets say 2 times. So the email from that user appears as 2 rows, where order_date is filled in first row date1 and second row date2. If i have multiple orders it continues 3rd row date3. Basically i want to merge these columns into 1. based on transaction_ids – Sander Van der Zeeuw Nov 16 '16 at 10:03