0

I have a combined dataframe that contains elevation data from SRTM for the US. However, the total size of my dataframe is 115200000 rows. This amount of data gives an error when plotting ''Error: vector memory exhausted (limit reached?)''. Therefore I want to resize it. Here is a copy of my dataframe:

structure(list(X = c(-139.995833333333, -139.9875, -139.979166666667, 
-139.970833333333, -139.9625, -139.954166666667, -139.945833333333, 
-139.9375, -139.929166666667, -139.920833333333, -139.9125, -139.904166666667, 
-139.895833333333, -139.8875, -139.879166666667, -139.870833333333, 
-139.8625, -139.854166666667, -139.845833333333, -139.8375), 
    Y = c(89.9958333333333, 89.9958333333333, 89.9958333333333, 
    89.9958333333333, 89.9958333333333, 89.9958333333333, 89.9958333333333, 
    89.9958333333333, 89.9958333333333, 89.9958333333333, 89.9958333333333, 
    89.9958333333333, 89.9958333333333, 89.9958333333333, 89.9958333333333, 
    89.9958333333333, 89.9958333333333, 89.9958333333333, 89.9958333333333, 
    89.9958333333333), Elevation = c(0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, 20L
), class = "data.frame")

These are 20 lines. How can I for example resize it so it has only 4 lines which are the average of every 5 lines?

Hope you can help me out!

Roelalex1996
  • 53
  • 1
  • 7
  • If your data are _data_, why do you want to change them? More standard practice is to plot a random `npts<-10000` sample of points `plot(df[sample(nrow(df), npts),]])`? There are also many tools for plotting distributions of points `smoothScatter` works like `plot` if you don't want to use ggplot2 style approaches, as in https://stackoverflow.com/a/7714834/2541138. – PeterK Dec 17 '20 at 12:08

3 Answers3

3

You can use aggregate.

aggregate(x, list(seq(0, length.out = nrow(x)) %/%5), FUN=mean)
#  Group.1         X        Y Elevation
#1       0 -139.9792 89.99583         0
#2       1 -139.9375 89.99583         0
#3       2 -139.8958 89.99583         0
#4       3 -139.8542 89.99583         0
GKi
  • 37,245
  • 2
  • 26
  • 48
1

You can try this approach :

library(dplyr)

n <- 5

df %>%
  group_by(grp = ceiling(row_number()/n)) %>%
  summarise(across(c(X, Y), first), 
            Elevation = mean(Elevation, na.rm = TRUE)) %>%
  select(-grp) -> result

result

For every 5 rows we have first value of X and Y and mean value of Elevation.


Since you have a large dataset using data.table would be beneficial :

library(data.table)

setDT(df)[, .(Elevation = mean(Elevation, na.rm = TRUE), 
              X = first(X), 
              Y = first(Y)), ceiling(seq_len(nrow(df))/n)]
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
0

We can use gl from base R to create the grouping column

library(dplyr)
n <- 5
df1 %>%
    group_by(grp = as.integer(gl(n(), n, n()))) %>% 
    summarise(across(c(X, Y), first), 
         Elevation = mean(Elevation, na.rm = TRUE), .groups = 'drop') %>% 
    select(-grp)
akrun
  • 874,273
  • 37
  • 540
  • 662