7

Folks,

I'm stumped with the following challenge. I have a data set that looks like this:

BuyerID    Fruit.1     Fruit.2    Fruit.3    Amount.1    Amount.2    Amount.3
879        Banana      Apple                 4           3
765        Strawberry  Apple      Orange     1           2           4
123        Orange      Banana                1           1           1
 11        Strawberry                        3
773        Kiwi        Banana                1           2

What I would like to do is to simplify the data (if possible) and collapse the "Fruit" and "Amount" variables

BuyerID    Fruit                             Amount      Total    Count
879        "Banana" "Apple"                  4  3            7        2
765        "Strawberry" "Apple" "Orange"     1  2  4         7        3
123        "Orange" "Banana"                 1  1  1         3        2
 11        "Strawberry"                      3               3        1
773        "Kiwi" "Banana"                   1  2            3        2

I have tried using c() and rbind() but they do not produce the results that I want - I've tried the tip here: data.frame rows to a list as well but am not too sure whether this is the best way to simplify my data.

This is so that presumably it would be easier for me to deal with fewer variables to count the occurrence of certain items for example (e.g. 60% buyers purchase Banana).

I hope this is doable - am also open to any suggestions. Any solutions appreciated!

Thank you.

Community
  • 1
  • 1
jacatra
  • 497
  • 1
  • 6
  • 8
  • You'll probably want to use the `data.table` package: `data.frame`s can only handle one value per cell. – Señor O Mar 06 '13 at 02:43
  • 2
    This looks like a pretty good candidate for a classic wide-to-long `reshape` solution. @AnandaMahto - where art thou? ;-) – thelatemail Mar 06 '13 at 02:51
  • 5
    @SeñorO -- `data.frames` can have lists too, it isn't just a data.table thing. You just have to be a bit creative when defining. eg `z <- data.frame(x = 1:5, y = I(lapply(seq_len(5),seq_len))` – mnel Mar 06 '13 at 02:58
  • And a SO reference for list columns http://stackoverflow.com/questions/9547518/creating-a-data-frame-where-a-column-is-a-list/13115651#13115651 – mnel Mar 06 '13 at 03:03
  • @SeñorO my answer demos the use of `data.frame` cells handling vectors and why this is possible (note this is a bad idea though) – Tyler Rinker Mar 06 '13 at 03:07

5 Answers5

11

Attempting to replicate your data, and using data.table

DT  <- data.frame(
  BuyerID = c(879,765,123,11,773), 
  Fruit.1 = c('Banana','Strawberry','Orange','Strawberry','Kiwi'),
  Fruit.2 = c('Apple','Apple','Banana',NA,'Banana'),
  Fruit.3 = c( NA, 'Orange',NA,NA,NA),
  Amount.1 = c(4,1,1,3,1), Amount.2 = c(3,2,1,NA,2), Amount.3 = c(NA,4,1,NA,NA),
  Total = c(7,7,3,3,3), 
  Count = c(2,3,2,1,2), 
  stringsAsFactors = FALSE)

# reshaping to long form and data.table

library(data.table)
DTlong <- data.table(reshape(DT, varying = list(Fruit = 2:4, Amount = 5:7), 
  direction = 'long'))

# create lists (without NA values)
# also adding count and total columns 
# by using <- to save Fruit and Amount for later use

DTlist <- DTlong[, list(Fruit <- list(as.vector(na.omit(Fruit.1))), 
                        Amount <- list(as.vector(na.omit(Amount.1))), 
                        Count  = length(unlist(Fruit)),
                        Total = sum(unlist(Amount))), 
                 by = BuyerID]

  BuyerID                      V1    V2 Count Total
1:     879            Banana,Apple   4,3     2     7
2:     765 Strawberry,Apple,Orange 1,2,4     3     7
3:     123           Orange,Banana 1,1,1     2     3
4:      11              Strawberry     3     1     3
5:     773             Kiwi,Banana   1,2     2     3

@RicardoSaporta edit:

You can skip the reshape step, if youd like, using list(list(c(....)))
This will probably save a fair bit of execution time (the drawback is it adds NAs not blank spaces). However, as @Marius points out, DTlong above is probably easier to work with.

DT <- data.table(DT)
DT[,   Fruit := list(list(c(  Fruit.1,   Fruit.2,   Fruit.3))), by=BuyerID]
DT[, Ammount := list(list(c(Amount.1, Amount.2, Amount.3))), by=BuyerID]

# Or as a single line
DT[,   list(  Fruit = list(c( Fruit.1,  Fruit.2,  Fruit.3)), 
            Ammount = list(c(Amount.1, Amount.2, Amount.3)), 
            Total, Count),  # other columns used
            by = BuyerID]
mnel
  • 113,303
  • 27
  • 265
  • 254
6

Here a solution , with base package. It is like Tyler solution but with a single apply.

res <- apply(DT,1,function(x){
  data.frame(Fruit= paste(na.omit(x[2:4]),collapse=' '),
             Amount = paste(na.omit(x[5:7]),collapse =','),
             Total = sum(as.numeric(na.omit(x[5:7]))),
             Count = length(na.omit(x[2:4])))
})
do.call(rbind,res)
                    Fruit  Amount Total Count
1            Banana Apple    4, 3     7     2
2 Strawberry Apple Orange 1, 2, 4     7     3
3           Orange Banana 1, 1, 1     3     2
4              Strawberry       3     3     1
5             Kiwi Banana    1, 2     3     2

I would also change the index number by a grep, something like this

 Fruit  = gregexpr('Fruit[.][0-9]', colnames(dat)) > 0  
 Amount = gregexpr('Amount[.][0-9]', colnames(dat)) > 0 

 x[2:4] replace by x[which(Fruit)]....

EDIT add some benchmarking.

library(microbenchmark)
library(data.table)
microbenchmark(ag(),mn(), am(), tr())
Unit: milliseconds
  expr       min        lq    median        uq       max
1 ag() 11.584522 12.268140 12.671484 13.317934 109.13419
2 am()  9.776206 10.515576 10.798504 11.437938 137.44867
3 mn()  6.470190  6.805646  6.974797  7.290722  48.68571
4 tr()  1.759771  1.929870  2.026960  2.142066   7.06032

For a small data.frame, Tyler Rinker is the winner!! How I explain this (just a guess)

  1. data:table solution suffer from the use of reshape and generally data.table is faster for big data.
  2. Ag study solution is slower because of the subsetting for each row , not like Tyler solution which subset before using apply.
  3. am solution is slow because of the use of reshape and merge..
agstudy
  • 119,832
  • 17
  • 199
  • 261
  • +1 Yes exactly, `reshape` is a non `data.table` function so `mn()` isn't a pure `data.table` solution. – Matt Dowle Mar 06 '13 at 10:56
  • @agstudy don't forget you selected `paste` as well which is slower than `unlist`. – Tyler Rinker Mar 06 '13 at 15:43
  • I don't think you could accomplish putting the vectors into cells with this technique. What you have is a string in a cell which is fairly easy to do but you've lost the ability to easily operate on the vector as a vector of numeric vectors; e.g. you can't do `sapply(Amount, max)` as you now have character vectors. – Tyler Rinker Mar 06 '13 at 16:29
  • @TylerRinker yes I get your point. the use of paste here is a sort of work around. even the OP doesn't precise if he want an atomic vector or a single string. – agstudy Mar 06 '13 at 17:04
5

This is a really bad idea but here it is in base data.frame. It works because data.frame is actually a list of equal length vectors. You can force data.frame to store vectors in cells but it takes some hackery. I'd suggest other formats including Marius's suggestion or a list.

DT <- data.frame(
  BuyerID = c(879,765,123,11,773), 
  Fruit.1 = c('Banana','Strawberry','Orange','Strawberry','Kiwi'),
  Fruit.2 = c('Apple','Apple','Banana',NA,'Banana'),
  Fruit.3 = c( NA, 'Orange',NA,NA,NA),
  Amount.1 = c(4,1,1,3,1), Amount.2 = c(3,2,1,NA,2), Amount.3 = c(NA,4,1,NA,NA),
  stringsAsFactors = FALSE)

DT2 <- DT[, 1, drop=FALSE]
DT2$Fruit <- apply(DT[, 2:4], 1, function(x) unlist(na.omit(x)))
DT2$Amount <- apply(DT[, 5:7], 1, function(x) unlist(na.omit(x)))
DT2$Total <- sapply(DT2$Amount, sum)
DT2$Count <- sapply(DT2$Fruit, length)

Yielding:

> DT2
  BuyerID                     Fruit  Amount Total Count
1     879             Banana, Apple    4, 3     7     2
2     765 Strawberry, Apple, Orange 1, 2, 4     7     3
3     123            Orange, Banana 1, 1, 1     3     2
4      11                Strawberry       3     3     1
5     773              Kiwi, Banana    1, 2     3     2
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • Possibly but forcing this is delicate. Not sure. – Tyler Rinker Mar 06 '13 at 03:10
  • I would argue that list(1:3,1:3,1:2) is a vector of length 3, so it is ok – mnel Mar 06 '13 at 03:15
  • `length(list(1:3,1:3,1:2))` is 3, so the point about equal length vectors is not valid. I agree that list columns can be awkward to deal with so perhaps I don't really have a point, other than being a pedant! – mnel Mar 06 '13 at 03:34
  • @TylerRinker +1 because my solution is fat cat comparing to yours :) I guess subsetting by row (within the apply) is too slow. – agstudy Mar 06 '13 at 06:04
4

Adding to the already existing great answers, here is another (sticking to base R):

with(DT, {
  # Convert to long format
  DTlong <- reshape(DT, direction = "long", 
                    idvar = "BuyerID", varying = 2:ncol(DT))
  # aggregate your fruit columns 
  # You need the `do.call(data.frame, ...)` to convert
  #   the resulting matrix-as-a-column into separate columns
  Agg1 <- do.call(data.frame, 
                  aggregate(Fruit ~ BuyerID, DTlong,
                            function(x) c(Fruit = paste0(x, collapse = " "),
                                          Count = length(x))))
  # aggregate the amount columns
  Agg2 <- aggregate(Amount ~ BuyerID, DTlong, sum)
  # merge the results
  merge(Agg1, Agg2)
})
#   BuyerID             Fruit.Fruit Fruit.Count Amount
# 1      11              Strawberry           1      3
# 2     123           Orange Banana           2      3
# 3     765 Strawberry Apple Orange           3      7
# 4     773             Kiwi Banana           2      3
# 5     879            Banana Apple           2      7

The basic concept is to:

  1. Use reshape to get your data in long form (which is where I think you should stop, actually)
  2. Use two different aggregate commands, one to aggregate your fruit columns, and one to aggregate your amount columns. The formula approach of aggregate takes care of removing NA, but you can specify your desired behavior with the na.action argument.
  3. Use merge to combine the two.
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
0

It didn't exist when the question was asked, but tidyr works well for this.

Reusing data from @mnel's answer,

library(tidyr)
separator <- ' '
DT %>%
  unite(Fruit, grep("Fruit", names(.)), sep = separator) %>%
  unite(Amount, grep("Amount", names(.)), sep = separator)

#   BuyerID                   Fruit  Amount Total Count
# 1     879         Banana Apple NA  4 3 NA     7     2
# 2     765 Strawberry Apple Orange   1 2 4     7     3
# 3     123        Orange Banana NA   1 1 1     3     2
# 4      11        Strawberry NA NA 3 NA NA     3     1
# 5     773          Kiwi Banana NA  1 2 NA     3     2
jaimedash
  • 2,683
  • 17
  • 30