-2

I am trying to get some statistical information using the code below:

library(data.table)

df <- fread("input.xyz", header=F, sep = " ", stringsAsFactors = F)
df2 <- read.table("input2.xyz", header=F, sep = " ", stringsAsFactors = F)

df2 <- df2[-which(df2$V3 == 0),]

long <- df2$V1
lat <- df2$V2
fin_mtx <- matrix(NA, nrow=18976, ncol=8)
colnames(fin_mtx) <- c("Longitude", "Latitude", "Mean", "Median", "Std Dev",
                       "Max", "Min", "No. of NA")
fin_mtx <- as.data.frame(fin_mtx)

i = 1
while (i < 18976)
{
  px_vl <- subset(df$V3, (df$V1 > long[i] - 0.125/2) & (df$V1 < long[i] + 0.125/2) & 
                         (df$V2 < lat[i] + 0.125/2) & (df$V2 > lat[i] - 0.125/2))
  frq <- as.data.frame(table(px_vl))

  if (frq[1,1] == -32768) {
     fin_mtx[i,8] <- frq[which(frq$px_vl==-32768),2]
     px_vl[px_vl == -32768] <- NA
  }

  fin_mtx[i,1] <- long[i]
  fin_mtx[i,2] <- lat[i]
  fin_mtx[i,3] <- mean(px_vl, na.rm = T)
  fin_mtx[i,4] <- median(px_vl, na.rm = T)
  fin_mtx[i,5] <- sd(px_vl, na.rm = T)
  fin_mtx[i,6] <- max(px_vl, na.rm = T)
  fin_mtx[i,7] <- min(px_vl, na.rm = T)
  i = i + 1
}

The df has close to 172 million rows and three columns whereas the df2 has 18,976 rows. Running the code takes a very long time (I mean days). Also, a lot of memory is used. I wanted to reduce this time and computation load. I went through some suggestions like defining the vector beforehand and using data.table in different tutorials, but they aren't helping much.

Parfait
  • 104,375
  • 17
  • 94
  • 125
Kuljeet Keshav
  • 125
  • 1
  • 4
  • 5
    Please share a few rows of either datasets. I bet this innocent line is the bottleneck: `frq <- as.data.frame(table(px_vl))` at 172 million row finding frequencies across all columns in each iteration! Epic. Just `table(head(mtcars))` creates 65,333 matrix slices! And to bind as `data.frame` returns a memory error. – Parfait Jul 05 '17 at 17:33
  • 1
    Also please describe in words what your code is doing. It looks like for a 1/8-degree grid you are calculating summary statistics of `px_vl`, but I'm not sure what your `if (frq[1,1] == -32768)` stuff is doing. – Gregor Thomas Jul 05 '17 at 17:41
  • @Parfait `px_vl <- subset(df$V3, (df$V1 > long[i] - 0.125/2) & (df$V1 < long[i] + 0.125/2) & (df$V2 < lat[i] + 0.125/2) & (df$V2 > lat[i] - 0.125/2))` is taking the most time. approx more than 25 sec when I ran for i=1 – Kuljeet Keshav Jul 05 '17 at 17:45
  • @Gregor I am trying to calculate the statistics of a greater resolution grid to convert it to 1/8 degree one. -32768 is the value for NA data in the df actually which I need to count – Kuljeet Keshav Jul 05 '17 at 17:47
  • How many columns does either dataset hold and how many unique values? Please share example data for a reproducible example. – Parfait Jul 05 '17 at 17:47
  • 1
    You should be using grouped data table operations. Right now, for every iteration you are calculating the rounded grid to find the subset, doing an expensive data frame conversion, and then calculating your statistics. You need to add the grouping columns once at the beginning, maybe replace `-32768` with `NA` once at the beginning, and then use use data table `.SD`. As others have said, share some small example data and we can help. Preferably share code to simulate about 100 rows of data with the right structure. – Gregor Thomas Jul 05 '17 at 18:05
  • Isn't subset faster on data.tables? – moodymudskipper Jul 05 '17 at 18:36
  • 6 times faster here: https://stackoverflow.com/questions/27303534/faster-way-to-subset-on-rows-of-a-data-frame-in-r – moodymudskipper Jul 05 '17 at 20:05
  • df$V1 and df$V2 should be attributed a range value before the subset, outside of the loop, you're computing 20.000 times what should be done once – moodymudskipper Jul 05 '17 at 20:09

2 Answers2

0

Try calculating longHigh <- long + 0.125/2 and longLow <- long - 0.125/2 and the same for latHigh and latLow outside of the loop, since that's a fixed calculation, and you're just calling elements out of each list with i.

That way you can reduce

 px_vl <- subset(df$V3, (df$V1 > long[i] - 0.125/2) & (df$V1 < long[i] + 0.125/2) & 
                         (df$V2 < lat[i] + 0.125/2) & (df$V2 > lat[i] - 0.125/2))

to

px_vl <- subset(df$V3, (df$V1 > longLow[i]) & (df$V1 < longHigh[i]) &
                        (df$V2 < latHigh[i]) & df$V2 > latLow[i]))

That removes four calculations from each iteration of the loop.

Also, I think you can simplify

 if (frq[1,1] == -32768) {
     fin_mtx[i,8] <- frq[which(frq$px_vl==-32768),2]
     px_vl[px_vl == -32768] <- NA
  }

by adding the na.strings argument to fread(..., na.strings = "-32768"), and at least skip having to assign NAs with px_vl[px_vl == -32768] <- NA

Mako212
  • 6,787
  • 1
  • 18
  • 37
0

I spend some time thinking about this question and I came up with some improvements:

1) As You did not give some example data, I created some myself:

n1 <- 1.72e8
n2 <- 19000

set.seed(21)
df <- data.frame(V1 = rnorm(n1), V2 = rnorm(n1), V3 = rnorm(n1))
df2 <- data.frame(V1 = rnorm(n2), V2 = rnorm(n2))
df$V3[seq(10, n1, 100)] <- 0 # lets assume 0 as missing value

2) In my testing I saw that working with vectors is more efficient than data.frame or data.table. So we coerce necessary columns to vectors:

long <- df2$V1
lat <- df2$V2
x3 <- df$V3
x2 <- df$V2
x1 <- df$V1
rm(df) # remove large dataset from memmory
gc()

3) Now we can find the missing value (in your case -32768) and replace it with NA

x3[x3 == 0] <- NA

4) It looks like that using summary function gives some speed improvement for calculating nearly all of your desired statistic, so we will use it:

rez2 <- matrix(NA, nrow = n2, ncol = 10)
colnames(rez2) <- c("Longitude", "Latitude",
                   names(summary(c(1, NA))), "Std Dev")


i <- 1
k <- 1

5) This calculation probably does not impact the speed of the loop, but it is cleaner to do them outside the loop:

lokn <- long - k
lokp <- long + k
lakn <- lat - k
lakp <- lat + k

6) the loop test, for 10 iteration:

tt <- proc.time()
while (i < 11) {
  lo_i <- long[i]
  la_i <- lat[i]

  w2 <- between(x1, lokn[i], lokp[i], incbounds = F) &
    between(x2, lakn[i], lakp[i], incbounds = F)
  px_vl <- x3[w2]

  if (length(px_vl) == 0) px_vl <- 0 ## added for caching empty px_vl,
  #probably you dont have this kind of problem in your data

  r2 <- c(lo_i, la_i,
          summary(px_vl),
          sd(px_vl, na.rm = T))

  rez2[i,] <- r2
  i = i + 1
}
rez
tt2 <- proc.time() - tt
tt2
# 55 sek for 10 iterations, so for 19k:
19000/10 *55 /60/60 # approx ~29 h

I found out that using between from data.table gives nice increase in speed, for selecting necessary values. Using it we get the indexes(true/false) of elements to select from x1 vector. And as I mentioned before using summary gives also some speed improvement. I encourage you to test this out, and give some feedback.

Also, how much RAM do you have? If its not a limitation, then there might be other solutions.

minem
  • 3,640
  • 2
  • 15
  • 29