0

I want to get the song that user's play most frequently. The three fields I want in the csv file are userId,songId and playCount but the select function is giving an error:

write.csv(group_by(mydata,userId) %.%
summarise(one=max(playCount)) %.%
select(userId,songId,playCount), file="FavouriteSongs.csv")

Error in eval(expr, envir, enclos) : object 'songId' not found

An example of the data looks like this

userId      songId            playCount
A           568r              85
A           711g              18
C           34n               18
E           454j              65
D           663a              72
B           35d               84
A           34c               72
A           982s              65
E           433f              11
A           565t              7

Thanks in advance

Cormac
  • 39
  • 1
  • 6
  • How does this differ from [your previous question](http://stackoverflow.com/questions/22209706/count-5-highest-values-of-a-variable)? – Thomas Mar 06 '14 at 18:02
  • The previous answer didn't give me 5 highest values for each user – Cormac Mar 06 '14 at 18:23
  • @Thomas My problem was adding `rev(sort(mydata$playCount))[1:5]` into my function `summarise(one=max(playCount))`. – Cormac Mar 06 '14 at 18:37

2 Answers2

2

In your chained sequence of dplyr operations, the summarise call will produce two columns: the grouping variable and the result of the summary function.

df %.%
  group_by(userId) %.%
  summarise(
    one = max(playCount))

# Source: local data frame [5 x 2]
# 
#   userId one
# 1      A  85
# 2      B  84
# 3      C  18
# 4      D  72
# 5      E  65

When you then try to select the songID variable from the data frame generated by summarise, the songID variable is not found.

df %.%
  group_by(userId) %.%
  summarise(
    one = max(playCount)) %.%
  select(userId, songId, playCount)
# Error in eval(expr, envir, enclos) : object 'songId' not found 

A more suitable dplyr function in this case is filter. Here we select rows where the condition playCount == max(playCount) is TRUE within each group.

df %.%
  group_by(userId) %.%
  filter(
    playCount == max(playCount))

# Source: local data frame [5 x 3]
# Groups: userId
# 
#   userId songId playCount
# 1      A   568r        85
# 2      C    34n        18
# 3      E   454j        65
# 4      D   663a        72
# 5      B    35d        84

You find several nice dplyr examples here.

Henrik
  • 65,555
  • 14
  • 143
  • 159
-1

There are several ways to achieve this.

being d your data.frame. retrieve the row with the most played song:

d[d$playCount == max(d$playCount), ]

For most played by user, try this

d <- data.frame(userId = rep(seq(1:5),2) ,    
                songId = letters[1:10],          
                playCount = c(10:19))

> d
   userId songId playCount
1       1      a        10
2       2      b        11
3       3      c        12
4       4      d        13
5       5      e        14
6       1      f        15
7       2      g        16
8       3      h        17
9       4      i        18
10      5      j        19


d2<- d[order(-d$playCount), ]
dout <- d2[!duplicated(d2$userId), ]

> dout
   userId songId playCount
10      5      j        19
9       4      i        18
8       3      h        17
7       2      g        16
6       1      f        15

The approach is correct and is fast, almost as fast as dplyr. Try it with a 1000000 rows data frame

df <- data.frame(userId = rep(seq(1:5),100000) ,    
                songId = rep(letters[1:10], 100000),          
                playCount = runif(1000000,10,20))

using @Henrik dplyr approach

system.time(df %.%
  group_by(userId) %.%
  filter(
    playCount == max(playCount)))
Source: local data frame [5 x 3]

Groups: userId

  userId songId playCount
1      2      b  19.99995
2      5      j  19.99982
3      1      f  19.99981
4      4      d  19.99995
5      3      h  19.99999

user  system elapsed 
0.08    0.02    0.09 

and using Hadley approach

df2<- df[order(-df$playCount), ]
dout <- df2[!duplicated(df2$userId), ]
> dout
       userId songId playCount
671528      3      h  19.99999
466824      4      d  19.99995
185512      2      b  19.99995
249190      5      j  19.99982
455746      1      f  19.99981

system.time(dout <- df2[!duplicated(df2$userId), ])

user  system elapsed 
0.13    0.00    0.12 
miken32
  • 42,008
  • 16
  • 111
  • 154
Paulo E. Cardoso
  • 5,778
  • 32
  • 42