How to find the row in a dataframe that most closely resembles a given vector

Question

Say I have a dataframe that looks like this:

Feature 1     Feature 2     Feature 3     Feature 4     Target
    1             1             1             1            a
    0             1             0             0            a 
    0             1             1             1            b

And a vector that looks like this:

0, 1, 1, 1

How would I find the indices of the closest matching rows to the vector? For example, if I wanted to find the 2 closest rows, I would input the vector and the dataframe (perhaps with the target column removed), and I would get indices 1 and 3 as a return from the function, since those rows most closely resemble the vector "0, 1, 1, 1".

I have tried using the "caret" package from R, with the command:

intrain <- createDataPartition(y = data$Target, p= 0.7, list = FALSE)
training <- data[intrain,]
testing <- data[-intrain,]

trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
knn_fit <- train(Target~., data = training, method = "knn", trControl = trctrl, preProcess = c("center", "scale"), tuneLength = 10)
test_pred <- predict(knn_fit, newdata = testing)
print(test_pred)

However, this doesn't return the index of the matching rows. It simply returns the predictions for the target that has features most closely matching the testing dataset.

I would like to find a model/command/function that can perform similarly to the KDtrees model from sklearn in python, but in R instead (KDtrees can return a list of the n closest indices). In addition, although not required, I would like said model to work with categorical values for features (such as TRUE/FALSE) so that I don't have to create dummy variables like I've done here with my 1's and 0's.

Define "most closely". In the example only the third row mathed the pattern for the numbered features. Furthermore having spaces in the column names suggests you haven't yet done any actual data input. First steps first for baby R programmers. — IRTFM, May 12 '19 at 00:58
I have a full dataset, but I did not want to copy it into here for the sake of simplicity (my dataset has 400 features). Same reason for why I put spaces in my column names. Most closely would preferably be via some sort of distance metric (With a default or predefined limit), but I'm not sure how that could be applied to categorical values such as "TRUE" or "FALSE". I'm hoping more experienced programmers could lend some insightful advice. — Adam Alayli, May 12 '19 at 01:02
This seems counter-productive. If you cannot construct a [MCVE] then it doesn't seem worthwhile spening coding effort. — IRTFM, May 12 '19 at 01:04
I have edited my example to make it more accurate. I hope the example I have given provides enough context for my goal, since my actual dataset is only a much larger version of the example I have given. I'm open to answering more questions. — Adam Alayli, May 12 '19 at 01:07
The ultimate goal would be to match the features inputted to several ID's of people with very similar feature sets. Then, I would look at other factors about these people and ultimately make decisions based on maximizing those factors in different ways. In the example, I have only provided one target, but in my dataset I would have multiple targets. I simply showed a target column to show how the caret package interacts with my data. I apologize for not being able to provide specifics. — Adam Alayli, May 12 '19 at 01:11

Evan Friedland · Accepted Answer · 2019-05-12T01:57:04.113

1

Agreed with 42's comment. With a simple distance metric, row 1 is equally different from the vector as 2.

# your data
featureframe <- data.frame(Feature1 = c(1,0,0), Feature2 = c(1,1,1), 
                           Feature3 = c(1,0,1), Feature4 = c(1,1,1), 
                           Target = c("a","a","b"))
vec <- c(0,1,1,1)

distances <- apply(featureframe[,1:4], 1, function(x) sum((x - vec)^2))
distances
# [1] 1 1 0

Edits as per comments:

To measure categorically what is similar you may instead quantify a similarity metric where the closer the sum is to the lenght of the vector, the closer the two vectors are:

similarity <- apply(featureframe[,1:4], 1, function(x) sum(x == vec))

If you'd like to weight certain features more, you can multiply the similarity vector inside the function by a weight vector of equal length.

similarity <- apply(featureframe[,1:4], 1, function(x) sum((x == vec) * c(1,2,1,1)))

edited May 12 '19 at 01:57

answered May 12 '19 at 01:03

Evan Friedland

3,062
1
11
25

I have edited my example to make it more accurate. I hope the example I have given provides enough context for my goal, since my actual dataset is only a much larger version of the example I have given. I'm open to answering more questions. – Adam Alayli May 12 '19 at 01:07
Can you give me some info on where this distance from vector calculation fails your need? – Evan Friedland May 12 '19 at 01:08
On second thought, this comes very close to what I need. I will try it out – Adam Alayli May 12 '19 at 01:13
Have you looked at other relevant questions? https://stackoverflow.com/questions/2453326/fastest-way-to-find-second-third-highest-lowest-value-in-vector-or-column – Evan Friedland May 12 '19 at 01:14
The only thing missing from this would be the application of categorical values. I have not found any solution to this, and any solution needs the use of dummy variables. This solution is limited to features with a yes or no value (or 1 and 0 in my case) – Adam Alayli May 12 '19 at 01:16
Well I mean, how do you quantify which is more different from "Red", a string with "Green" or "Blue"? It seems you have to provide more rules for categorical variables. – Evan Friedland May 12 '19 at 01:18
That's exactly my problem, and I have had to convert my dataset into the form given above as a rudimentary workaround. I was just hoping someone knew of a solution. – Adam Alayli May 12 '19 at 01:21
1

One problem I encountered when using your code: What if my target was (0,0,0,1), and there was a row in the dataframe that had (1,0,0,0)? Your code would count those two as the same, although they are completely different. – Adam Alayli May 12 '19 at 01:25
1

That is a very fair question :) In that sense, instead of using distance, we may want to use a count of equal values. Where 4 equals a perfect match. `distances <- apply(featureframe[,1:4], 1, function(x) sum(x == vec))` – Evan Friedland May 12 '19 at 01:27
Thank you, I could then simply find the index of the n highest values in distances to solve my problem. – Adam Alayli May 12 '19 at 01:36
Quick follow-up, once again. What if I wanted to assign a weight, and prefer certain features over others? For example, what if I wanted similarity Feature2 to be twice as important as similarity to Feature1? – Adam Alayli May 12 '19 at 01:45
You would just assign weights inside the function by multiplying a weight vector of equal length. `sum( (x == vec) * c(1,2,1,1) )` If this answer is satisfactory please mark your question as answered – Evan Friedland May 12 '19 at 01:50

Dij · Answer 2 · 2019-05-12T05:54:23.363

To find the smallest distances between vectors, you can make a distance matrix:

mat <- matrix(c(1,1,1,1
                0,1,0,0,
                0,1,1,1,
                0,1,1,1), 
              ncol = 4, byrow = T)
#the following will find the euclidean distance between each row vector
dist(mat, method = "euclidean")
         1        2        3
2 1.732051                  
3 1.000000 1.414214         
4 1.000000 1.414214 0.000000

Clearly, the minimum is here between rows 3 and 4 since they are identical

How to find the row in a dataframe that most closely resembles a given vector

2 Answers2