2

I have a data set of about a thousand rows, but it will grow. I want to find groups of rows where a certain number of attributes are almost identical.

I can do a stupid brute-force search, but it's really slow and I'm sure R can do this much better.

Test data:

df = read.table(text="
      Date     Time  F1   F2   F3   F4   F5 Conc
2010-06-23 04:00:00 0.1 17.6 14.2 19.5 18.6 16.1
2010-12-16 00:20:00 0.0 28.7 13.7 15.6 16.2 14.4
2010-12-16 10:30:00 0.0 17.0 14.0 19.0 18.0 16.0
2010-12-16 22:15:00 0.0 25.5 12.6 14.9 16.6 16.8
", header=T)


#Initialize column to hold clustering info
df$cluster = NA

#Maximum tolerance for a match between rows 
toler=1.0

#Brute force search, very ugly.
for(i in 1:(nrow(df)-1)){
    if(is.na(df$cluster[i])){
        df$cluster[i] <- i
        for(j in (i+1):nrow(df)){
            if(max(abs(df[i,3:7] - df[j,3:7]))<toler){
                df$cluster[j]<-i
            }
        }
    }
}
if(is.na(df$cluster[j])){df$cluster[j] <- j}

df

Expected output:

        Date     Time  F1   F2   F3   F4   F5 Conc cluster
1 2010-06-23 04:00:00 0.1 17.6 14.2 19.5 18.6 16.1       1
2 2010-12-16 00:20:00 0.0 28.7 13.7 15.6 16.2 14.4       2
3 2010-12-16 10:30:00 0.0 17.0 14.0 19.0 18.0 16.0       1
4 2010-12-16 22:15:00 0.0 25.5 12.6 14.9 16.6 16.8       4
David Richfield
  • 123
  • 1
  • 8

1 Answers1

2

Here is an option using data.table:

df[, (cols) := c(.SD-1, .SD+1), .SDcols=fcols]
onstr <- c(paste0(fcols,">",cols[rng]),paste0(fcols,"<",cols[5+rng]))
df[, cluster := df[df, on=onstr, by=.EACHI, min(x.rn)]$V1]

output:

         Date     Time  F1   F2   F3   F4   F5 Conc rn lower1 lower2 lower3 lower4 lower5 upper1 upper2 upper3 upper4 upper5 cluster
1: 2010-06-23 04:00:00 0.1 17.6 14.2 19.5 18.6 16.1  1   -0.9   16.6   13.2   18.5   17.6    1.1   18.6   15.2   20.5   19.6       1
2: 2010-12-16 00:20:00 0.0 28.7 13.7 15.6 16.2 14.4  2   -1.0   27.7   12.7   14.6   15.2    1.0   29.7   14.7   16.6   17.2       2
3: 2010-12-16 10:30:00 0.0 17.0 14.0 19.0 18.0 16.0  3   -1.0   16.0   13.0   18.0   17.0    1.0   18.0   15.0   20.0   19.0       1
4: 2010-12-16 22:15:00 0.0 25.5 12.6 14.9 16.6 16.8  4   -1.0   24.5   11.6   13.9   15.6    1.0   26.5   13.6   15.9   17.6       4

data:

df = read.table(text="
      Date     Time  F1   F2   F3   F4   F5 Conc
2010-06-23 04:00:00 0.1 17.6 14.2 19.5 18.6 16.1
2010-12-16 00:20:00 0.0 28.7 13.7 15.6 16.2 14.4
2010-12-16 10:30:00 0.0 17.0 14.0 19.0 18.0 16.0
2010-12-16 22:15:00 0.0 25.5 12.6 14.9 16.6 16.8
", header=T)

library(data.table)
setDT(df)[, rn := .I]
toler <- 1.0
rng <- 1L:5L
fcols <- paste0("F", rng)
cols <- do.call(paste0, CJ(c("lower", "upper"), rng))

Explanation:

  • L behind an integer is telling R this is of type integer (see Why would R use the "L" suffix to denote an integer?)

  • df[df, on=onstr, by=.EACHI, min(x.rn)] is performing a non-equi self-join using the inequalities specified in onstr.

  • $V1 access the V1 column of the above join (naming is defaulted to V* when not provided)

  • df[, cluster := my_result] updates the original data.table by reference so that its faster when dataset is large. Improvement is due to the fact that there is no need to do a deep copy of the original data.frame as in base R.

  • since we are performing a join, there is a left table and right table. In data.table lingo, for x[i, on=join_keys], x is right table and i is left table (inspiration comes from base R syntax). Hence, x. is used to refer to access columns in the left table like a SQL syntax and similarly for i.. More details can be found in the data.table vignettes. (see https://cran.r-project.org/web/packages/data.table/)

chinsoon12
  • 25,005
  • 4
  • 25
  • 35