Aggregate large data.frame

Question

I have a large data frame (>1.000.000 entries) with one column containing a date/time variable and one column containing a numeric value. The problem is that some of the date/time variables occur twice or three times and the respective numeric values need to be averaged so I end up with one numeric value for each date/time variable.

Until now, I am doing the following:

## audio_together is the dataframe with two colums $timestamp and $amplitude 
## (i.e. the numeric value)

timestamp_unique <- unique(audio_together$timestamp)   ## find all timestamps
  audio_together3 <- c(rep(NA, length(timestamp_unique)))  ## audio_together 3 is the new vector containing the values for each timestamp
  count = 0
  for (k in 1:length(timestamp_unique)){
    temp_time <- timestamp_unique[k]
    if (k==1){
      temp_subset <- audio_together[(1:10),]  ## look for timestamps only in a subset, which definitely contains the timestamp we are looking for
      temp_data_which <- which(temp_subset$timestamp == temp_time)
    } else {
      temp_subset <- audio_together[((count):(count+9)),]
      temp_data_which <- which(temp_subset$timestamp == temp_time)
    }
    if (length(temp_data_which) > 1){
      audio_together3[k] <- mean(temp_subset$amplitude[temp_data_which], na.rm = T)
    } else {
      audio_together3[k] <- temp_subset$amplitude[temp_data_which]
    }
    count <- count + length(temp_data_which)
  }

However, this process still is quite slow. Any ideas to SIGNIFICANTLY (i.e. in the time range of minutes) speed up the process?

UPDATE: Example

timestamp <- c("2015-09-03 18:54:13", "2015-09-03 18:54:14", "2015-09-03 18:54:14", "2015-09-03 18:54:15", "2015-09-03 18:54:15", "2015-09-03 18:54:16", "2015-09-03 18:54:16", "2015-09-03 18:54:17", "2015-09-03 18:54:17")
amplitude <- c(200, 313, 321, 432, 111, 423, 431, 112, 421)

audio_together <- data.frame(timestamp, amplitude)

Can you provide a small significant sample of your data and the expected output? Grouping like the one you want are handled in many ways: `tapply`, `ave` and `aggregate` in base R. The `data.table` and `dplyr` packages will very likely provide the needed speed. — nicola, Apr 05 '16 at 10:27
`library(data.table); setDT(audio_together); audio_together[, .(amplitude = mean(amplitude, na.rm = TRUE)), by = timestamp]` — Roland, Apr 05 '16 at 10:27
Did you check [this](http://stackoverflow.com/questions/21982987/mean-per-group-in-a-data-frame)? — David Arenburg, Apr 05 '16 at 10:29

score 0 · Answer 1 · edited May 23 '17 at 12:31

0

It's hard to test without of minimal reproducible example but if your intention is to average all amplitude sharing the same timestamp, then this dplyr solution may help:

library(dplyr)
audio_together %>% 
  group_by(timestamp) %>% 
  summarize(av_amplitude=mean(amplitude, na.rm=T)) %>% 
  ungroup()

edited May 23 '17 at 12:31

Community

1
1

answered Apr 05 '16 at 10:26

Vincent Bonhomme

7,235
2
27
38

score 0 · Answer 2 · answered Apr 05 '16 at 13:31

0

Thanks for your ideas.

The following works perfectly:

require(dplyr)
audio_together <- audio_together %>% group_by(timestamp)
audio_together <- ungroup(audio_together %>% summarise(mean(amplitude, na.rm=T)))

answered Apr 05 '16 at 13:31

Christine Blume

507
3
7
17

Aggregate large data.frame

2 Answers2