0

Data:

Hi Everyone,

I have a data of size close to 4GB's having columns as "UserID, MediaID,Full/Mini. I want to know how many full and mini episodes each user has watched. Basically each row having No. of Full and Mini Episodes watched by each user. Also let me know how can we do it more efficiently since the data size is huge and it will slow down the processing.

Thanks. Any help will be highly appreciated.

dat=data.frame(id=c("a","a","a","b","c"), media_id=c("1a","1b","1c","2b","2c"), Full_mini=c("ful","ful","mini","mini","full")) id=c("a","a","a","b","c") 
Sotos
  • 51,121
  • 6
  • 32
  • 66
Ashish
  • 41
  • 7
  • Any help from anyone – Ashish Jun 30 '17 at 13:11
  • 1
    Could you please provide some lines of your data? Full and Mini are different columns?. Probably `aggregate` will do what you want. – S Rivero Jun 30 '17 at 13:14
  • @SRivero: Since I am a new user I am not allowed to post pic of data untill I gain some badges but still you can click "Data " written above the description of the question and have a look at the data. – Ashish Jun 30 '17 at 13:20
  • https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – S Rivero Jun 30 '17 at 13:22
  • Rather than post a picture, a good example of what to do is to give us a few lines of code to come up with a sample object. For example, if your data is in a data frame, you could construct a data frame of 10-20 rows with the kind of duplicated data you care about. That saves us a lot of time! – Joy Jun 30 '17 at 13:22
  • @Joy: dat=data.frame(id=c("a","a","a","b","c"), media_id=c("1a","1b","1c","2b","2c"), Full_mini=c("ful","ful","mini","mini","full")) id=c("a","a","a","b","c") – Ashish Jun 30 '17 at 13:46
  • What is your expected output for the example you just shared? – Sotos Jun 30 '17 at 14:09

1 Answers1

0

You can do it with table.

subset(as.data.frame(table(dat[-2])),Freq>0)
#   id Full_mini Freq
# 1  a       ful    2
# 6  c      full    1
# 7  a      mini    1
# 8  b      mini    1

Typos are yours!

If it's still too slow, split it in in 2, that's a fortunate thing of your dataset that you have only two possible values for last col. then you'll have 2 smaller datasets on which you'll apply a count on only one col, that should be fast.

dat_full <- subset(dat,Full_mini == "full" | Full_mini == "ful") 
dat_mini <- subset(dat,Full_mini == "mini")
library(magrittr)
res_full <- dat_full$id %>% 
  table %>% 
  as.data.frame %>%
  subset(Freq>0) %>%
  transform(Full_mini = "full") %>%
  setNames(c("id","Freq","Full_mini"))

res_mini <- dat_mini$id %>% 
  table %>% 
  as.data.frame %>%
  subset(Freq>0) %>%
  transform(Full_mini = "mini") %>%
  setNames(c("id","Freq","Full_mini"))

res <- rbind(res_full,res_mini)

Or side by side:

res_full <- dat_full$id %>% 
  table %>% 
  as.data.frame

res_mini <- dat_mini$id %>% 
  table %>% 
  as.data.frame

res <- setNames(cbind(res_full[1:2],res_mini[2]),c("id","full","mini"))

  id full Freq
1  a    2    1
2  b    0    1
3  c    1    0
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167