2

I am working with two different large data set and trying to make use of mapply() to get iterative functions working.

The goal is to take each data point column wise from Data_1, and compare it against both the data points in column of Data_2. So, Data_1[1,1] will be compared against Data_2[1,1] and Data_2[2,1] only. To be more clear, data1 column in Data_1 will only be compared against dataA elements in Data_2, thus no cross column comparison.

Data_1: NxM

  data1       data2       data3      data4
-0.710003   -0.714271   -0.709946   -0.713645
-0.710458   -0.715011   -0.710117   -0.714157
-0.71071    -0.714048   -0.710235   -0.713515
-0.710255   -0.713991   -0.709722   -0.713972

Data_2: PxQ

  dataA       dataB       dataC      dataD
-0.71097    -0.714059   -0.70928    -0.714059
-0.710343   -0.714576   -0.709338   -0.713644

I had earlier written a for() while() loop based algorithm, but the run time was too much as the original data is . Then, I moved to apply() based logic, but still had loops within function I was calling, so that didn't speed up the code. Based on my earlier question, I am figuring out better way to do this with mapply().

The part I am not able to visualize is the column to row comparison and how mapply() will navigate over it recursively. How can I use mapply() or lapply() to get this done efficiently?

Any suggestions will be helpful, thanks.

Chetan Arvind Patil
  • 854
  • 1
  • 11
  • 31

2 Answers2

2

Consider a nested apply family call:

  • mapply() - outer: pairwise iteration between corresponding columns of Data_1 and Data_2
  • sapply - inner: vector iteration of each value in the Data_1 column for element comparison

Below checks if each Data_1 value is between the two values of each column of Data_2:

Data

txt = '  data1       data2       data3      data4
-0.710003   -0.714271   -0.709946   -0.713645
-0.710458   -0.715011   -0.710117   -0.714157
-0.71071    -0.714048   -0.710235   -0.713515
-0.710255   -0.713991   -0.709722   -0.713972'

Data_1 <- read.table(text=txt, header=TRUE)

txt = ' dataA       dataB       dataC      dataD
-0.71097    -0.714059   -0.70928    -0.714059
-0.710343   -0.714576   -0.709338   -0.713644'

Data_2 <- read.table(text=txt, header=TRUE)

Code

check_inbetween <- function(x,y){
  sapply(x, function(i) (i > y[1] & i < y[2]))
}

inbetween_matrix <- mapply(check_inbetween, Data_1, Data_2)

inbetween_matrix
#      data1 data2 data3 data4
# [1,] FALSE FALSE FALSE  TRUE
# [2,]  TRUE FALSE FALSE FALSE
# [3,]  TRUE FALSE FALSE FALSE
# [4,] FALSE FALSE FALSE  TRUE
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Thanks @Parfait, I guess even `sweep()` can be [used](https://stackoverflow.com/questions/44959588/optimize-apply-while-in-r) – Chetan Arvind Patil Jul 07 '17 at 01:18
  • 1
    Possibly. I never used `sweep()`. And that implementation has an interesting nested sweep. My attempt seems to not recycle the function but only runs on first columns of datasets: `sweep(as.matrix(Data_1), 1, as.matrix(Data_2), FUN = function(x,y) (x > y[1] & x < y[2]), check.margin = FALSE)` – Parfait Jul 07 '17 at 01:58
0

That's one solution based on data.table, but if you are using data.frame or matrix, it should be easy to adapt. To reach what you want you must use one lapply over another lapply. The higher one iterates over the columns, which calls the other to iterate over the rows.

library(data.table)

# it gets all elements of column 'j' to do diff computation
get_column_diff <- function(dt_1, dt_2, j){

        get_point_diff <- function(i){
                # it should return a vector with all differences 
                # in comparison of the point (i,j)
                unlist(dt_1[i, ..j]) - unlist(dt_2[, ..j])
        }


        i_rows <- 1:nrow(dt_1)
        lapply(X=i_rows, FUN=get_point_diff)

}

j_cols <- 1:ncol(Data_1)
lapply(FUN=get_column_diff, dt_1=Data_1, dt_2=Data_2, X=j_cols)

The function returns a list of list, each element of the list is the column result and its element is a list with the row result.

About the speed gain, i can't say how fast it will be without benchmark comparison, but probably it will be faster the any loop.

Rafael Toledo
  • 974
  • 13
  • 19
  • Thanks @RafaelToledo. Your answer helps, but I need to dig more into how `unlist()` and `list()` work. Also, the line `j_cols <- 1:ncol(dt_1)` will lead to an error, right? as `dt_1` is being assigned `Data_1` after that? – Chetan Arvind Patil Jul 06 '17 at 16:40
  • my bad, i'll update it, inside the `ncol` must be `Data_1`. To organize your output, you have to think how you want it first, because your output will be something like `(n x n) x n`. – Rafael Toledo Jul 06 '17 at 17:26