spread() where non-unique values are put into a new column

Question

I have some data that looks like this (code for input at the end):

#>           artist          album year  source                     id
#> 1        Beatles  Sgt. Pepper's 1967  amazon             B0025KVLTM
#> 2        Beatles  Sgt. Pepper's 1967 spotify 6QaVfG1pHYl1z15ZxkvVDW
#> 3        Beatles  Sgt. Pepper's 1967  amazon             B06WGVMLJY
#> 4 Rolling Stones Sticky Fingers 1971 spotify 29m6DinzdaD0OPqWKGyMdz

I would like to fix the 'id' column (which includes ids from multiple sources, as shown in the 'source' column.

This should be a straightfoward spread(), but the complication is that sometimes we have a duplicate id from the exact same source: see row 1 and row 3 above.

Is there an easy way to do the spread() and put the duplicate id in a new column?

My desired outcome would be:


#>           artist          album year  source  amazon_id amazon_id_2
#> 1        Beatles  Sgt. Pepper's 1967  amazon B0025KVLTM  B06WGVMLJY
#> 2 Rolling Stones Sticky Fingers 1971 spotify       <NA>        <NA>
#>                  spotify
#> 1 6QaVfG1pHYl1z15ZxkvVDW
#> 2 29m6DinzdaD0OPqWKGyMdz

The code below is to input the sample data:

df <- data.frame(stringsAsFactors=FALSE,
      artist = c("Beatles", "Beatles", "Beatles", "Rolling Stones"),
       album = c("Sgt. Pepper's", "Sgt. Pepper's", "Sgt. Pepper's",
                 "Sticky Fingers"),
        year = c(1967, 1967, 1967, 1971),
      source = c("amazon", "spotify", "amazon", "spotify"),
          id = c("B0025KVLTM", "6QaVfG1pHYl1z15ZxkvVDW", "B06WGVMLJY",
                 "29m6DinzdaD0OPqWKGyMdz")
)
df

score 3 · Answer 1 · answered Jun 21 '19 at 08:35

This can be done with dcast from data.table in one (looong) line. But therefore quite elegantly I think.

library(data.table)
dcast(df, artist + album + year ~ paste(source, rowid(artist, source), sep = "_"))
#          artist          album year   amazon_1   amazon_2              spotify_1
#1        Beatles  Sgt. Pepper's 1967 B0025KVLTM B06WGVMLJY 6QaVfG1pHYl1z15ZxkvVDW
#2 Rolling Stones Sticky Fingers 1971       <NA>       <NA> 29m6DinzdaD0OPqWKGyMdz

tmfmnk · Answer 2 · 2019-06-21T08:38:07.483

One possibility could be:

df %>%
 group_by(artist, album, year, source) %>%
 mutate(source2 = paste(source, row_number(), sep = "_")) %>%
 spread(source2, id) %>%
 ungroup()

  artist         album           year source  amazon_1   amazon_2   spotify_1             
  <chr>          <chr>          <dbl> <chr>   <chr>      <chr>      <chr>                 
1 Beatles        Sgt. Pepper's   1967 amazon  B0025KVLTM B06WGVMLJY <NA>                  
2 Beatles        Sgt. Pepper's   1967 spotify <NA>       <NA>       6QaVfG1pHYl1z15ZxkvVDW
3 Rolling Stones Sticky Fingers  1971 spotify <NA>       <NA>       29m6DinzdaD0OPqWKGyMdz

Notice that here the output consists of three rows as spotify is a unique "source" for the Beatles album.

If you, nevertheless, want two rows, you can do:

df %>%
 group_by(artist, album, year, source) %>%
 mutate(source2 = paste(source, row_number(), sep = "_")) %>%
 ungroup() %>%
 select(-source) %>%
 spread(source2, id) 

  artist         album           year amazon_1   amazon_2   spotify_1             
  <chr>          <chr>          <dbl> <chr>      <chr>      <chr>                 
1 Beatles        Sgt. Pepper's   1967 B0025KVLTM B06WGVMLJY 6QaVfG1pHYl1z15ZxkvVDW
2 Rolling Stones Sticky Fingers  1971 <NA>       <NA>       29m6DinzdaD0OPqWKGyMdz

And if you want to have also the "source" column:

df %>%
 group_by(artist, album, year, source) %>%
 mutate(source2 = paste(source, row_number(), sep = "_")) %>%
 group_by(artist, album, year) %>%
 mutate(source = toString(unique(source))) %>%
 spread(source2, id) %>%
 ungroup()

  artist         album           year source          amazon_1  amazon_2  spotify_1            
  <chr>          <chr>          <dbl> <chr>           <chr>     <chr>     <chr>                
1 Beatles        Sgt. Pepper's   1967 amazon, spotify B0025KVL… B06WGVML… 6QaVfG1pHYl1z15ZxkvV…
2 Rolling Stones Sticky Fingers  1971 spotify         <NA>      <NA>      29m6DinzdaD0OPqWKGyM…

I'm trying to work out how the code works for your answer with two rows (i.e. your second answer). I wonder if you can help me understand how it stops three lines from being the result? I tried breaking the code up, so I could see each step (code above in postscript to question) but somehow breaking it up means that three lines get returned. — Jeremy K., Jun 22 '19 at 08:37
I'm not having luck posting code as comments, but this is what I meant: df2 <- df %>% group_by(artist, album, year, source) %>% mutate(source2 = paste(source, row_number(), sep = "_")) %>% ungroup() %>% select(-source) df2 # spread(source2, id) df2 <- df2 %>% spread(source2, id) — Jeremy K., Jun 22 '19 at 08:43
It does it by removing the "source" column in the step `select(-source)`. — tmfmnk, Jun 22 '19 at 09:09

score 2 · Answer 3 · answered Jun 21 '19 at 10:34

Also possible in base R with ave and reshape.

df$source <- with(df, paste(source, 
                            ave(artist, source, FUN=function(i) 
                              cumsum(duplicated(i)) + 1)), sep="_")
reshape(df, timevar="source", idvar=c("artist", "album", "year"), direction="wide")
#           artist          album year id.amazon_1           id.spotify_1 id.amazon_2 id.amazon_3
# 1        Beatles  Sgt. Pepper's 1967  B0025KVLTM 6QaVfG1pHYl1z15ZxkvVDW  B06WGVMLJY     SoMeFoO
# 4 Rolling Stones Sticky Fingers 1971        <NA> 29m6DinzdaD0OPqWKGyMdz        <NA>        <NA>

Data

df <- structure(list(artist = c("Beatles", "Beatles", "Beatles", "Rolling Stones"
), album = c("Sgt. Pepper's", "Sgt. Pepper's", "Sgt. Pepper's", 
"Sticky Fingers"), year = c(1967, 1967, 1967, 1971), source = c("amazon", 
"spotify", "amazon", "spotify"), id = c("B0025KVLTM", "6QaVfG1pHYl1z15ZxkvVDW", 
"B06WGVMLJY", "29m6DinzdaD0OPqWKGyMdz")), class = "data.frame", row.names = c(NA, 
-4L))
df <- rbind(df, df[1, ])
df[5, 5] <- "SoMeFoO"

score 1 · Accepted Answer · answered Jun 21 '19 at 09:46

1

Here is one approach.

df %>% 
  group_by(artist,source) %>%  
  mutate(rownum = row_number()) %>% 
  unite(source, source, rownum, sep="_") %>% 
  spread(source,id)

# A tibble: 2 x 6
# Groups:   artist [2]
  artist         album           year amazon_1   amazon_2   spotify_1             
  <chr>          <chr>          <dbl> <chr>      <chr>      <chr>                 
1 Beatles        Sgt. Pepper's   1967 B0025KVLTM B06WGVMLJY 6QaVfG1pHYl1z15ZxkvVDW
2 Rolling Stones Sticky Fingers  1971 NA         NA         29m6DinzdaD0OPqWKGyMdz

answered Jun 21 '19 at 09:46

Theo

575
3
8

Can you explain why the final line `spread(source, id) doesn't create two lines for the Beatles one, even though there are two sources for Beatles, i.e. "amazon_1" and "amazon_2"? – Jeremy K. Jun 22 '19 at 02:17
Also, in this line: `unite(source, source, rownum, sep="_")`, I'm trying to work out why we put "source" twice. – Jeremy K. Jun 22 '19 at 02:24

spread() where non-unique values are put into a new column

4 Answers4