Reshape from long to wide format

Question

example

customer_code    items
1                sugar
1                salt       
2                sugar      
2                accessories
3                salt

Desired output

customer_code   item   item2       item3
1              sugar   salt     
2              sugar             accessories
3                      salt

The `NA` value in expected output for `des_subcat1` is not making sense as the customer id `3.12E+12` have articulo in `des_subcat` column — akrun, May 26 '15 at 04:50

David Arenburg · Answer 1 · 2015-05-22T11:31:45.587

5

You can do a simple dcast here

library(reshape2)
dcast(df, customer_code ~ paste("items", items, sep = "_"), value.var = "items")
#   customer_code items_accessories items_salt items_sugar
# 1             1              <NA>       salt       sugar
# 2             2       accessories       <NA>       sugar
# 3             3              <NA>       salt        <NA>

Or a bit closer to your desired output

library(data.table)
setDT(df)[, indx := paste0("items", .GRP), by = items]
dcast(df, customer_code ~ indx, value.var = "items")
#    customer_code items1 items2      items3
# 1:             1  sugar   salt          NA
# 2:             2  sugar     NA accessories
# 3:             3     NA   salt          NA

edited May 22 '15 at 11:31

answered May 22 '15 at 11:01

David Arenburg

91,361
17
137
196

I like the way you created indx with `.GRP` – akrun May 22 '15 at 13:50
it returns numeric value and identifies only the first item – Saugandh Datta May 26 '15 at 04:05
So you need to provide s reproducible exsmple because it works on your data – David Arenburg May 26 '15 at 04:13

score 3 · Accepted Answer · answered May 22 '15 at 10:12

3

You could use spread from tidyr

library(dplyr)
library(tidyr)
  mutate(df1, var=factor(items, levels=unique(items), 
      labels=paste0('items', seq(n_distinct(items))))) %>% 
            spread(var, items, fill='')
#  customer_code items1 items2      items3
#1             1  sugar   salt            
#2             2  sugar        accessories
#3             3          salt

answered May 22 '15 at 10:12

akrun

874,273
37
540
662

Error: Duplicate identifiers for rows – Saugandh Datta May 26 '15 at 04:02
@SaugandthData Based on the example provided, I didn't get any errors. You may need to create a sequence variable for your original dataset as there are duplicates. You can update the post with a small example that gives the error – akrun May 26 '15 at 04:06
check %>% group_by(customer_code) %>% spread(des_subcat, des_subcat) -> df_wide Error: Duplicate identifiers for rows (353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363), (111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121), (683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 693), (34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 726), (518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528), (474, 475, – Saugandh Datta May 26 '15 at 04:17
@SaugandhDatta I already said the reason for the error. If you can update your post with a small example that reproduce the error, I can test it. – akrun May 26 '15 at 04:21
article_code customer_code des_subsettore des_subcat 9014 8.92E+12 GROCERY CONDIMENTI PRIMI (PELATI & SUGHI) 9014 8.92E+12 GROCERY CONDIMENTI PRIMI (PELATI & SUGHI) 9014 8.92E+12 GROCERY articlo 9014 3.25E+12 GROCERY articlo 9014 3.25E+12 GROCERY CONDIMENTI PRIMI (PELATI & SUGHI) 9014 3.25E+12 GROCERY CONDIMENTI PRIMI (PELATI & SUGHI) – Saugandh Datta May 26 '15 at 04:26
@SaugandhDatta Please update it in your original post. It is not easy to copy/paste from the comments as the format is not good – akrun May 26 '15 at 04:28
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/78731/discussion-between-saugandh-datta-and-akrun). – Saugandh Datta May 26 '15 at 04:40

Cath · Answer 3 · 2015-05-22T11:07:28.643

you can try using function reshape:

to get as many columns as there are different items possible:

new_df <- reshape(df, idvar="customer_code", timevar="items", v.names="items", direction="wide")
new_df
#  customer_code items.sugar items.salt items.accessories
#1             1       sugar       salt              <NA>
#3             2       sugar       <NA>       accessories
#5             3        <NA>       salt              <NA>

you can change the column names afterwards with colnames(new_df)[-1] <- paste0("item", 1:(ncol(new_df)-1))

another option, in case you want to get as many column as the max number of items a unique customer can have:

df_split <- split(df, df[, 1])
df_split <- lapply(df_split, reshape, idvar="customer_code", timevar="items", v.names="items", direction="wide")
max_item <- max(sapply(df_split, ncol))
df_split <- lapply(df_split, function(df){ 
                                 if(ncol(df) < max_item) df <- cbind(df, matrix(NA, ncol=max_item - ncol(df)))
                                 colnames(df)[-1] <- paste0("item", 1:(max_item-1))
                                 return(df)
                              })
new_df <- do.call("rbind", df_split)
new_df
#  customer_code item1       item2
#1             1 sugar        salt
#2             2 sugar accessories
#3             3  salt        <NA>

inscaven · Answer 4 · 2015-05-22T11:59:17.693

Packages dplyr and especially tidyr can solve this kind of problems. This code does the trick.

require("tidyr")
require("dplyr")
df %>% group_by(customer_code) %>% spread(items, items) -> df_wide
#   customer_code accessories salt sugar
# 1             1          NA salt sugar
# 2             2 accessories   NA sugar
# 3             3          NA salt    NA

Hope there is no problem to change colnames if necessary:

names(df_wide)[-1] <- paste0("item", 1:(ncol(df_wide)-1))
#   customer_code       item1 item2 item3
# 1             1          NA  salt sugar
# 2             2 accessories    NA sugar
# 3             3          NA  salt    NA

In addition may suggest this form of output (may be convenient):

df  %>% mutate(present = T) %>% spread(items, present, fill = F)
#   customer_code accessories  salt sugar
# 1             1       FALSE  TRUE  TRUE
# 2             2        TRUE FALSE  TRUE
# 3             3       FALSE  TRUE FALSE

Error: Duplicate identifiers for rows – Saugandh Datta May 26 '15 at 04:04 — Saugandh Datta, May 26 '15 at 04:04

Reshape from long to wide format

4 Answers4