I am trying to get some statistical information using the code below:
library(data.table)
df <- fread("input.xyz", header=F, sep = " ", stringsAsFactors = F)
df2 <- read.table("input2.xyz", header=F, sep = " ", stringsAsFactors = F)
df2 <- df2[-which(df2$V3 == 0),]
long <- df2$V1
lat <- df2$V2
fin_mtx <- matrix(NA, nrow=18976, ncol=8)
colnames(fin_mtx) <- c("Longitude", "Latitude", "Mean", "Median", "Std Dev",
"Max", "Min", "No. of NA")
fin_mtx <- as.data.frame(fin_mtx)
i = 1
while (i < 18976)
{
px_vl <- subset(df$V3, (df$V1 > long[i] - 0.125/2) & (df$V1 < long[i] + 0.125/2) &
(df$V2 < lat[i] + 0.125/2) & (df$V2 > lat[i] - 0.125/2))
frq <- as.data.frame(table(px_vl))
if (frq[1,1] == -32768) {
fin_mtx[i,8] <- frq[which(frq$px_vl==-32768),2]
px_vl[px_vl == -32768] <- NA
}
fin_mtx[i,1] <- long[i]
fin_mtx[i,2] <- lat[i]
fin_mtx[i,3] <- mean(px_vl, na.rm = T)
fin_mtx[i,4] <- median(px_vl, na.rm = T)
fin_mtx[i,5] <- sd(px_vl, na.rm = T)
fin_mtx[i,6] <- max(px_vl, na.rm = T)
fin_mtx[i,7] <- min(px_vl, na.rm = T)
i = i + 1
}
The df has close to 172 million rows and three columns whereas the df2 has 18,976 rows. Running the code takes a very long time (I mean days). Also, a lot of memory is used. I wanted to reduce this time and computation load. I went through some suggestions like defining the vector beforehand and using data.table
in different tutorials, but they aren't helping much.