I have a data set of about a thousand rows, but it will grow. I want to find groups of rows where a certain number of attributes are almost identical.
I can do a stupid brute-force search, but it's really slow and I'm sure R can do this much better.
Test data:
df = read.table(text="
Date Time F1 F2 F3 F4 F5 Conc
2010-06-23 04:00:00 0.1 17.6 14.2 19.5 18.6 16.1
2010-12-16 00:20:00 0.0 28.7 13.7 15.6 16.2 14.4
2010-12-16 10:30:00 0.0 17.0 14.0 19.0 18.0 16.0
2010-12-16 22:15:00 0.0 25.5 12.6 14.9 16.6 16.8
", header=T)
#Initialize column to hold clustering info
df$cluster = NA
#Maximum tolerance for a match between rows
toler=1.0
#Brute force search, very ugly.
for(i in 1:(nrow(df)-1)){
if(is.na(df$cluster[i])){
df$cluster[i] <- i
for(j in (i+1):nrow(df)){
if(max(abs(df[i,3:7] - df[j,3:7]))<toler){
df$cluster[j]<-i
}
}
}
}
if(is.na(df$cluster[j])){df$cluster[j] <- j}
df
Expected output:
Date Time F1 F2 F3 F4 F5 Conc cluster
1 2010-06-23 04:00:00 0.1 17.6 14.2 19.5 18.6 16.1 1
2 2010-12-16 00:20:00 0.0 28.7 13.7 15.6 16.2 14.4 2
3 2010-12-16 10:30:00 0.0 17.0 14.0 19.0 18.0 16.0 1
4 2010-12-16 22:15:00 0.0 25.5 12.6 14.9 16.6 16.8 4