1

I have dataset containing million observations from dataset i'm taking 10000 observations. Here is link to dataset file: dataset file link

itemRatingData = itemRatingData[1:10000,]
#V2 is user ID, V1 is item ID, V3 is item rating from use

library(plyr)
countUser = count(itemRatingData, vars = "V2")
#counted the total obeservation per user in dataset

list_of_total_Users = as.list(countUser$V2)
#taking out total number of users as a list

next thing i want to do is to extract those users observation who have rated 10 items minimum and i successfully did that. now i have such users who have rated 50, 100 and 1000+ items but i only need 10 observation from users who have minimum rated 10+ items. i did what comes to mind to get desired results:

for (i in 1:length(list_of_total_Users)) {
    occurencePerID = subset(itemRatingData, 
    itemRatingData$V2%in%list_of_total_Users[[i]])

    countOccurencePerID = count(occurencePerID, vars = "V2")
    if(countOccurencePerID$freq >= 10){
       newItemRatingData = occurencePerID[1:10,]
    }
}

in this code i'm subsetting total observations per user id and then counted them. if user id frequency >= 10 then extract first 10 observations. now the problem i'm facing is every time loop iterate it overwrite the newItemRatingData.

  • 1
    Welcome to SO! Please do not use images to supply data. Please read https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example and the first part of https://stackoverflow.com/tags/r/info – jogo Aug 03 '18 at 12:13
  • added link to dataset file – Saad ur Rehman Aug 06 '18 at 06:08

2 Answers2

0

Even though I can't reproduce your issue without the data it seems like you are replacing the results in newItemRatingData every iteration. If you use cbind() you can append your rows to newItemRatingData withour replacing what's already there

newItemRatingData = data.frame()
for (i in 1:length(list_of_total_Users)) {
    occurencePerID = subset(itemRatingData, 
    itemRatingData$V2%in%list_of_total_Users[[i]])

    countOccurencePerID = count(occurencePerID, vars = "V2")
    if(countOccurencePerID$freq >= 10){
       newItemRatingData = cbind(newItemRatingData,occurencePerID[1:10,])
    }
}
Fino
  • 1,774
  • 11
  • 21
  • getting this error: "Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 0, 10" ....further added dataset file, here is link : https://drive.google.com/open?id=12UY1SvDvbu7sFfeM6qFwuzjWDE33IGGr – Saad ur Rehman Aug 06 '18 at 06:06
0

i have resolved my issue and solution is:

newItemRatingData = data.frame("V2" = numeric(0), "V1" = numeric(0), "V3" = integer(0))

for (i in 1:length(list_of_total_Users)) {
  occurencePerID = subset(itemRatingData, itemRatingData$V2%in%list_of_total_Users[[i]])

  countOccurencePerID = count(occurencePerID, vars = "V2")
  if(countOccurencePerID$freq >= 10){
     newItemRatingData = rbind(newItemRatingData,occurencePerID[1:10,])  
 }
}

as for @fino answer that answer was binding column wise dataframe. solution that i find binding dataframe row wise