0

I'm kind of new in r. A got a big frame called MegaFrame (11 000 000 rows). I want to make another dataset with the mean of data MegaFrame$value, for my different sessions and P_CODE. This gives a lot of NA, because a lot of P_CODE-session pairs don't exist in the frame. I find a (I think working) solution, but now it has been running for 12 hours and still not finished.

colClasses = c("integer", "factor", "integer")
col.names = c("MeanMesure", "P_CODE", "session")

MeanFrame <- data.frame( mean(MegaFrame$value[MegaFrame$session == unique(MegaFrame$session)[i] && MegaFrame$P_CODE == levels(MegaFrame$P_CODE)[i]]),
                       MegaFrame$P_CODE[i],MegaFrame$session[j])
                         colnames(MeanFrame) = col.names
                      MeanFrame<-   MeanFrame[-1,]

for(i in 1:length(unique(MegaFrame$session))){
for(j in 1:length(levels(MegaFrame$P_CODE))){
x<-mean(MegaFrame$value[MegaFrame$session == unique(MegaFrame$session)[i] && MegaFrame$P_CODE == levels(MegaFrame$P_CODE)[i]])  
df<- data.frame(x,MegaFrame$P_CODE[i],MegaFrame$session[j])
colnames(df) = col.names
MeanFrame<-rbind(MeanFrame, df)
}}

I know I can add a condition so that the NA values are not added to the dataframe. But I feel my method is too heavy (making every iteration a df, changing his name, then rbind) for what I want to do, but I don't know how to make it softer. I already had a lot of trouble with adding rows to the dataframe.

Has anybody ideas for this?

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214

1 Answers1

1

Based on your problem description, I don't think there is a need for for loop. You can try:

library(tidyverse);
MeanFrame <- MegaFrame %>%
    group_by(P_code, session) %>%
    summarise(mean.value = mean(value))

You might have to use mean(value, na.rm = T) instead of mean(value) to deal with NAs.

The reason why your code is so slow is because you're growing MeanFrame dynamically by adding row after row. That's about as inefficient as you can go, and can and should generally be avoided. If you must use a for loop, pre-allocating an empty data.frame of the correct dimensions would speed things up.

On a side note: It's advised to always provide a minimal reproducible example with sample data.

Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • Thanks a lot, the difference of efficiency between the two solutions is impressive, your's took 2 seconds! For next time I'll put reproducible example :-) – Matthias Gorremans Apr 12 '18 at 09:44