After a while of asking R merged loop performance question, i have come as far as the code below.
My .csv file is 1250 lines long and 2500+ columns wide. The data type in columns are not known beforehand, can be positive numbers or strings,etc.
What I am trying to achieve is to leave the lines that have 95% equal items to any other line (equal values in any of the 2375 columns out of 2500).
Initially I tried to check for similarities, but then I realized that going by elimination rather than aggregation would require less computation. Since I am looking for 95% similarity, if a line has 2500-2375 = 125 + 1 non-equal columns to every other line in the dataset, then I can be sure that this line will never have 95% similarity to other lines and can be removed from further processing.
The code below successfully looks at the leftmost checkColCount (%6) columns for all rows, then deletes rows that are non-similar to all other rows (if such exist) and then moves onto the next checkColCount columns. The resulting rows in xdat (if any) are those that have more than 95% similar columns.
The problem is time, I tried to put this into parApply, but the results were different than this (lots of empty rows in between). I assume, ideally I should be able to parallelize both apply's below.
Note: I am on windows OS
Working code:
xdat <- read.csv("ttt.csv", header =TRUE, stringsAsFactors = FALSE )
#Normally, I'm interested in at least 95% of the data to be **equal**
perc <- 95
#replacing NA's with unique random negative numbers to make sures that NA's don't appear as identical between rows
xdat[is.na(xdat)] <- -1 * sample(dim(xdat)[1]*dim(xdat)[2], size=sum(is.na(xdat)), replace=FALSE)
xdat<-rbind(xdat,xdat[1,])
compareRow <- function(a,b)
{
sum(a!=b,na.rm=TRUE)
}
#adding the first row to the end for testing
xdat<-rbind(xdat,xdat[1,])
system.time( try(
for (firstColToCheck in seq(1,100-(100-perc),100-perc)){
lastColToCheck <- 100-perc +1 + firstColToCheck
checkColCount <- lastColToCheck- firstColToCheck +1
xx <- apply(xdat[,firstColToCheck:lastColToCheck],1, function(a)
{
apply(xdat[,firstColToCheck:lastColToCheck],1, function(b) {compareRow(a,b) })
})
sumXX <-rowSums(xx)
checkValue <- checkColCount*(nrow(xdat)-1)
xdat<-xdat[rowSums(xx)<checkValue,]
cat(firstColToCheck , " ")
gc(verbose = FALSE)
} , silent = TRUE )
)
xdat0 <- xdat
# user system elapsed
# 5.22 0.00 5.21
My try at parApply:
xdat <- read.csv("ttt.csv", header =TRUE, stringsAsFactors = FALSE )
#Normally, I'm interested in at least 95% of the data to be **equal**
perc <- 95
#replacing NA's with unique random negative numbers to make sures that NA's don't appear as identical between rows
xdat[is.na(xdat)] <- -1 * sample(dim(xdat)[1]*dim(xdat)[2], size=sum(is.na(xdat)), replace=FALSE)
xdat<-rbind(xdat,xdat[1,])
library(parallel)
no_cores <- detectCores() - 1
cl1 <- makeCluster(no_cores)
clusterExport(cl1, c('compareRow','firstColToCheck','xdat','lastColToCheck','checkColCount'))
clusterEvalQ(cl1, library(parallel))
system.time( try(
for (firstColToCheck in seq(1,100-(100-perc),100-perc)){
lastColToCheck <- 100-perc +1 + firstColToCheck
checkColCount <- lastColToCheck- firstColToCheck +1
xx <- parApply(cl1,xdat[,firstColToCheck:lastColToCheck],1, function(a)
{
apply(xdat[,firstColToCheck:lastColToCheck],1, function(b) {compareRow(a,b) })
})
sumXX <-rowSums(xx)
checkValue <- checkColCount*(nrow(xdat)-1)
xdat<-xdat[rowSums(xx)<checkValue,]
cat(firstColToCheck , " ")
} , silent = TRUE )
)
stopCluster(cl1)
# user system elapsed
# 0.14 0.00 0.63
xdat1 <- xdat
Sample Data:
id group hs.grad race gender age m.status political n.kids income
1 treat yes white male 19 never republican 1 4716
2 control yes black male 30 divorced independent 2 4724
3 control yes black female 32 married republican 3 1096
4 control no white male 35 divorced republican 4 1084
5 control yes white female 18 married republican 5 4720
6 control yes asian male 22 married independent 6 2577
7 control yes white female 26 never democrat 7 3154
8 control yes asian male 40 married republican 8 3267
9 control yes asian female 23 married independent 9 3603
10 treat yes white male 19 divorced republican 1 4716
Sample result assuming I dont require 95% but 70% equality:
id group hs.grad race gender age m.status political n.kids income
1 treat yes white male 19 never republican 1 4716
10 treat yes white male 19 divorced republican 1 4716
only rows 1 and 10 should be left, since they have 8 columns equal (more than 70%) all other rows are different by at least 4 columns to any other row so they have been removed from the set.