3

example

customer_code    items
1                sugar
1                salt       
2                sugar      
2                accessories
3                salt

Desired output

customer_code   item   item2       item3
1              sugar   salt     
2              sugar             accessories
3                      salt
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • The `NA` value in expected output for `des_subcat1` is not making sense as the customer id `3.12E+12` have articulo in `des_subcat` column – akrun May 26 '15 at 04:50

4 Answers4

5

You can do a simple dcast here

library(reshape2)
dcast(df, customer_code ~ paste("items", items, sep = "_"), value.var = "items")
#   customer_code items_accessories items_salt items_sugar
# 1             1              <NA>       salt       sugar
# 2             2       accessories       <NA>       sugar
# 3             3              <NA>       salt        <NA>

Or a bit closer to your desired output

library(data.table)
setDT(df)[, indx := paste0("items", .GRP), by = items]
dcast(df, customer_code ~ indx, value.var = "items")
#    customer_code items1 items2      items3
# 1:             1  sugar   salt          NA
# 2:             2  sugar     NA accessories
# 3:             3     NA   salt          NA
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
3

You could use spread from tidyr

library(dplyr)
library(tidyr)
  mutate(df1, var=factor(items, levels=unique(items), 
      labels=paste0('items', seq(n_distinct(items))))) %>% 
            spread(var, items, fill='')
#  customer_code items1 items2      items3
#1             1  sugar   salt            
#2             2  sugar        accessories
#3             3          salt        
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Error: Duplicate identifiers for rows – Saugandh Datta May 26 '15 at 04:02
  • @SaugandthData Based on the example provided, I didn't get any errors. You may need to create a sequence variable for your original dataset as there are duplicates. You can update the post with a small example that gives the error – akrun May 26 '15 at 04:06
  • check %>% group_by(customer_code) %>% spread(des_subcat, des_subcat) -> df_wide Error: Duplicate identifiers for rows (353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363), (111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121), (683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 693), (34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 726), (518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528), (474, 475, – Saugandh Datta May 26 '15 at 04:17
  • @SaugandhDatta I already said the reason for the error. If you can update your post with a small example that reproduce the error, I can test it. – akrun May 26 '15 at 04:21
  • article_code customer_code des_subsettore des_subcat 9014 8.92E+12 GROCERY CONDIMENTI PRIMI (PELATI & SUGHI) 9014 8.92E+12 GROCERY CONDIMENTI PRIMI (PELATI & SUGHI) 9014 8.92E+12 GROCERY articlo 9014 3.25E+12 GROCERY articlo 9014 3.25E+12 GROCERY CONDIMENTI PRIMI (PELATI & SUGHI) 9014 3.25E+12 GROCERY CONDIMENTI PRIMI (PELATI & SUGHI) – Saugandh Datta May 26 '15 at 04:26
  • @SaugandhDatta Please update it in your original post. It is not easy to copy/paste from the comments as the format is not good – akrun May 26 '15 at 04:28
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/78731/discussion-between-saugandh-datta-and-akrun). – Saugandh Datta May 26 '15 at 04:40
2

you can try using function reshape:

to get as many columns as there are different items possible:

new_df <- reshape(df, idvar="customer_code", timevar="items", v.names="items", direction="wide")
new_df
#  customer_code items.sugar items.salt items.accessories
#1             1       sugar       salt              <NA>
#3             2       sugar       <NA>       accessories
#5             3        <NA>       salt              <NA>

you can change the column names afterwards with colnames(new_df)[-1] <- paste0("item", 1:(ncol(new_df)-1))

another option, in case you want to get as many column as the max number of items a unique customer can have:

df_split <- split(df, df[, 1])
df_split <- lapply(df_split, reshape, idvar="customer_code", timevar="items", v.names="items", direction="wide")
max_item <- max(sapply(df_split, ncol))
df_split <- lapply(df_split, function(df){ 
                                 if(ncol(df) < max_item) df <- cbind(df, matrix(NA, ncol=max_item - ncol(df)))
                                 colnames(df)[-1] <- paste0("item", 1:(max_item-1))
                                 return(df)
                              })
new_df <- do.call("rbind", df_split)
new_df
#  customer_code item1       item2
#1             1 sugar        salt
#2             2 sugar accessories
#3             3  salt        <NA>
Cath
  • 23,906
  • 5
  • 52
  • 86
1

Packages dplyr and especially tidyr can solve this kind of problems. This code does the trick.

require("tidyr")
require("dplyr")
df %>% group_by(customer_code) %>% spread(items, items) -> df_wide
#   customer_code accessories salt sugar
# 1             1          NA salt sugar
# 2             2 accessories   NA sugar
# 3             3          NA salt    NA

Hope there is no problem to change colnames if necessary:

names(df_wide)[-1] <- paste0("item", 1:(ncol(df_wide)-1))
#   customer_code       item1 item2 item3
# 1             1          NA  salt sugar
# 2             2 accessories    NA sugar
# 3             3          NA  salt    NA

In addition may suggest this form of output (may be convenient):

df  %>% mutate(present = T) %>% spread(items, present, fill = F)
#   customer_code accessories  salt sugar
# 1             1       FALSE  TRUE  TRUE
# 2             2        TRUE FALSE  TRUE
# 3             3       FALSE  TRUE FALSE
inscaven
  • 2,514
  • 19
  • 29