1

After a while of asking R merged loop performance question, i have come as far as the code below.

My .csv file is 1250 lines long and 2500+ columns wide. The data type in columns are not known beforehand, can be positive numbers or strings,etc.

What I am trying to achieve is to leave the lines that have 95% equal items to any other line (equal values in any of the 2375 columns out of 2500).

Initially I tried to check for similarities, but then I realized that going by elimination rather than aggregation would require less computation. Since I am looking for 95% similarity, if a line has 2500-2375 = 125 + 1 non-equal columns to every other line in the dataset, then I can be sure that this line will never have 95% similarity to other lines and can be removed from further processing.

The code below successfully looks at the leftmost checkColCount (%6) columns for all rows, then deletes rows that are non-similar to all other rows (if such exist) and then moves onto the next checkColCount columns. The resulting rows in xdat (if any) are those that have more than 95% similar columns.

The problem is time, I tried to put this into parApply, but the results were different than this (lots of empty rows in between). I assume, ideally I should be able to parallelize both apply's below.

Note: I am on windows OS

Working code:

xdat <- read.csv("ttt.csv", header =TRUE, stringsAsFactors = FALSE )
#Normally, I'm interested in at least 95% of the data to be **equal** 
perc <- 95

#replacing NA's with unique random negative numbers to make sures that NA's don't appear as identical between rows
xdat[is.na(xdat)] <- -1 * sample(dim(xdat)[1]*dim(xdat)[2], size=sum(is.na(xdat)), replace=FALSE) 
xdat<-rbind(xdat,xdat[1,])

compareRow <- function(a,b)
  {
    sum(a!=b,na.rm=TRUE)
  }
#adding the first row to the end for testing
xdat<-rbind(xdat,xdat[1,])
system.time( try(
for (firstColToCheck in seq(1,100-(100-perc),100-perc)){
  lastColToCheck <- 100-perc +1 + firstColToCheck
  checkColCount <- lastColToCheck- firstColToCheck +1
  xx <- apply(xdat[,firstColToCheck:lastColToCheck],1, function(a)
    {
      apply(xdat[,firstColToCheck:lastColToCheck],1, function(b) {compareRow(a,b)  })
    })
  sumXX <-rowSums(xx)
  checkValue <- checkColCount*(nrow(xdat)-1)
  xdat<-xdat[rowSums(xx)<checkValue,]
  cat(firstColToCheck , " ")
  gc(verbose = FALSE)
} , silent = TRUE )
)
xdat0 <- xdat
#   user  system elapsed
#   5.22    0.00    5.21

My try at parApply:

xdat <- read.csv("ttt.csv", header =TRUE, stringsAsFactors = FALSE )
#Normally, I'm interested in at least 95% of the data to be **equal** 
perc <- 95
#replacing NA's with unique random negative numbers to make sures that NA's don't appear as identical between rows
xdat[is.na(xdat)] <- -1 * sample(dim(xdat)[1]*dim(xdat)[2], size=sum(is.na(xdat)), replace=FALSE) 
xdat<-rbind(xdat,xdat[1,])

library(parallel)
no_cores <- detectCores() - 1
cl1 <- makeCluster(no_cores)
clusterExport(cl1, c('compareRow','firstColToCheck','xdat','lastColToCheck','checkColCount'))
clusterEvalQ(cl1, library(parallel))


system.time( try(
  for (firstColToCheck in seq(1,100-(100-perc),100-perc)){
    lastColToCheck <- 100-perc +1 + firstColToCheck
    checkColCount <- lastColToCheck- firstColToCheck +1
    xx <- parApply(cl1,xdat[,firstColToCheck:lastColToCheck],1, function(a)
    {
      apply(xdat[,firstColToCheck:lastColToCheck],1, function(b) {compareRow(a,b)  })
    })
    sumXX <-rowSums(xx)
    checkValue <- checkColCount*(nrow(xdat)-1)

    xdat<-xdat[rowSums(xx)<checkValue,]
    cat(firstColToCheck , " ")
  } , silent = TRUE )
)
stopCluster(cl1)
#   user  system elapsed
#   0.14    0.00    0.63
xdat1 <- xdat

Sample Data:

id  group   hs.grad race    gender  age m.status    political   n.kids  income
1   treat   yes white   male    19  never   republican  1   4716
2   control yes black   male    30  divorced    independent 2   4724
3   control yes black   female  32  married republican  3   1096
4   control no  white   male    35  divorced    republican  4   1084
5   control yes white   female  18  married republican  5   4720
6   control yes asian   male    22  married independent 6   2577
7   control yes white   female  26  never   democrat    7   3154
8   control yes asian   male    40  married republican  8   3267
9   control yes asian   female  23  married independent 9   3603
10  treat   yes white   male    19  divorced    republican  1   4716

Sample result assuming I dont require 95% but 70% equality:

id  group   hs.grad race    gender  age m.status    political   n.kids  income
1   treat   yes white   male    19  never   republican  1   4716
10  treat   yes white   male    19  divorced    republican  1   4716

only rows 1 and 10 should be left, since they have 8 columns equal (more than 70%) all other rows are different by at least 4 columns to any other row so they have been removed from the set.

  • 1
    From a quick read of your Q, you may be able to do this without explicit loops. (For example, you could get the count of similar observations across rows using `proxy::dist(d, function(x, y) sum(x == y), diag=TRUE)`, which you would the divide by ncols to get proportions. - untested for speed) But it would definitely be helpful if you could add a small example dataset that illustrates your problem , say for example, with dimension 10x10, and then show the expected outcome. – user20650 Jul 22 '17 at 00:54
  • I can't work with similarity functions, I explicitly need to look at equality when comparing columns. Similarity functions create different distances when I compare 100 vs 20 and 100 vs 99, but for me both are not-equal and should both result false. – Bahadir Ozkurt Jul 22 '17 at 05:36
  • The manually added function in the above comment tests equality. – user20650 Jul 22 '17 at 14:05
  • @user20650 Sorry, I failed to see the last part, my mind stopped at dist :) replacing your function with my non-paralleled initial version made the whole thing much slower user system elapsed 80.26 0.00 80.27 – Bahadir Ozkurt Jul 22 '17 at 18:14

0 Answers0