Resizing dataframe in R

Question

I have a combined dataframe that contains elevation data from SRTM for the US. However, the total size of my dataframe is 115200000 rows. This amount of data gives an error when plotting ''Error: vector memory exhausted (limit reached?)''. Therefore I want to resize it. Here is a copy of my dataframe:

structure(list(X = c(-139.995833333333, -139.9875, -139.979166666667, 
-139.970833333333, -139.9625, -139.954166666667, -139.945833333333, 
-139.9375, -139.929166666667, -139.920833333333, -139.9125, -139.904166666667, 
-139.895833333333, -139.8875, -139.879166666667, -139.870833333333, 
-139.8625, -139.854166666667, -139.845833333333, -139.8375), 
    Y = c(89.9958333333333, 89.9958333333333, 89.9958333333333, 
    89.9958333333333, 89.9958333333333, 89.9958333333333, 89.9958333333333, 
    89.9958333333333, 89.9958333333333, 89.9958333333333, 89.9958333333333, 
    89.9958333333333, 89.9958333333333, 89.9958333333333, 89.9958333333333, 
    89.9958333333333, 89.9958333333333, 89.9958333333333, 89.9958333333333, 
    89.9958333333333), Elevation = c(0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, 20L
), class = "data.frame")

These are 20 lines. How can I for example resize it so it has only 4 lines which are the average of every 5 lines?

Hope you can help me out!

If your data are _data_, why do you want to change them? More standard practice is to plot a random `npts<-10000` sample of points `plot(df[sample(nrow(df), npts),]])`? There are also many tools for plotting distributions of points `smoothScatter` works like `plot` if you don't want to use ggplot2 style approaches, as in https://stackoverflow.com/a/7714834/2541138. — PeterK, Dec 17 '20 at 12:08

GKi · Accepted Answer · 2020-12-17T10:27:40.597

3

You can use aggregate.

aggregate(x, list(seq(0, length.out = nrow(x)) %/%5), FUN=mean)
#  Group.1         X        Y Elevation
#1       0 -139.9792 89.99583         0
#2       1 -139.9375 89.99583         0
#3       2 -139.8958 89.99583         0
#4       3 -139.8542 89.99583         0

edited Dec 17 '20 at 10:27

answered Dec 17 '20 at 10:22

GKi

37,245
2
26
48

Ronak Shah · Answer 2 · 2020-12-17T10:24:24.297

You can try this approach :

library(dplyr)

n <- 5

df %>%
  group_by(grp = ceiling(row_number()/n)) %>%
  summarise(across(c(X, Y), first), 
            Elevation = mean(Elevation, na.rm = TRUE)) %>%
  select(-grp) -> result

result

For every 5 rows we have first value of X and Y and mean value of Elevation.

Since you have a large dataset using data.table would be beneficial :

library(data.table)

setDT(df)[, .(Elevation = mean(Elevation, na.rm = TRUE), 
              X = first(X), 
              Y = first(Y)), ceiling(seq_len(nrow(df))/n)]

score 0 · Answer 3 · answered Dec 17 '20 at 18:04

We can use gl from base R to create the grouping column

library(dplyr)
n <- 5
df1 %>%
    group_by(grp = as.integer(gl(n(), n, n()))) %>% 
    summarise(across(c(X, Y), first), 
         Elevation = mean(Elevation, na.rm = TRUE), .groups = 'drop') %>% 
    select(-grp)

Resizing dataframe in R

3 Answers3