23

I'm busy working on a project involving k-nearest neighbor (KNN) classification. I have mixed numerical and categorical fields. The categorical values are ordinal (e.g. bank name, account type). Numerical types are, for e.g. salary and age. There are also some binary types (e.g., male, female).

How do I go about incorporating categorical values into the KNN analysis?

As far as I'm aware, one cannot simply map each categorical field to number keys (e.g. bank 1 = 1; bank 2 = 2, etc.), so I need a better approach for using the categorical fields. I have heard that one can use binary numbers. Is this a feasible method?

SecretAgentMan
  • 2,856
  • 7
  • 21
  • 41
Graham
  • 541
  • 2
  • 5
  • 13
  • Regarding converting categorical data into binary values: look at http://arxiv.org/pdf/1210.7070v3.pdf the beginning of sec. 2 describe this conversion. – Shai Nov 29 '12 at 18:31
  • 1
    I'm using k-nearest neighbor clustering. I want to generate a cluster of k = 20 points around a test point using multiple parameters/dimensions (Age, sex, bank, salary, account type). For account type, for e.g., you have current account, cheque account and savings account (categorical data). Salary, however, is continuous (numerical). How do I use categorical fields with continuous fields so as to carry out KNN clustering? – Graham Nov 29 '12 at 20:04
  • do you have any training data? It sounds like you need to do some metric learning... – Shai Nov 29 '12 at 20:17
  • I've split the whole data set into 20% testing, 80% training. Never heard of metric learning? Can't I just find some kind of numerical equivalent for the categorical data? – Graham Nov 29 '12 at 20:18
  • 1
    Are you sure you aren't talking about **knn classification**? – Has QUIT--Anony-Mousse Nov 30 '12 at 00:14
  • Technically, it's KNN regression. However, whether it's classification or regression, the problem is the same. – Graham Nov 30 '12 at 07:33
  • Yes, but the term "clustering" is misleading, as you don't seem to be interested in finding new clusters. I was wondering if you are doing k-means or something, but the name "kNN clustering" wouldn't fit for these either. – Has QUIT--Anony-Mousse Nov 30 '12 at 07:51
  • My apologies - I'm new to this subject! – Graham Nov 30 '12 at 08:18

3 Answers3

25

You need to find a distance function that works for your data. The use of binary indicator variables solves this problem implicitly. This has the benefit of allowing you to continue your probably matrix based implementation with this kind of data, but a much simpler way - and appropriate for most distance based methods - is to just use a modified distance function.

There is an infinite number of such combinations. You need to experiment which works best for you. Essentially, you might want to use some classic metric on the numeric values (usually with normalization applied; but it may make sense to also move this normalization into the distance function), plus a distance on the other attributes, scaled appropriately.

In most real application domains of distance based algorithms, this is the most difficult part, optimizing your domain specific distance function. You can see this as part of preprocessing: defining similarity.

There is much more than just Euclidean distance. There are various set theoretic measures which may be much more appropriate in your case. For example, Tanimoto coefficient, Jaccard similarity, Dice's coefficient and so on. Cosine might be an option, too.

There are whole conferences dedicated to the topics of similarity search - nobody claimed this is trivial in anything but Euclidean vector spaces (and actually, not even there): http://www.sisap.org/2012

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • 1
    Thanks for the answer. I'm still struggling with how to actually implement this. I'm using a simple Matlab function, "knnsearch", which calculates Euclidean distances to each point. Now, sure, I could say male/female = 0/1, or bank 1 = 100, bank 2 = 010, bank 3 = 001. However, how do I use this? I just need a little guidance about where to begin implementing a distance function, and whether its possible to still use Matlab's function... – Graham Nov 30 '12 at 08:24
  • 3
    I don't use Matlab, so I don't know how to do it there. Euclidean distance makes sense in physical 2d/3d, but isn't that good in higher dimensional non-physical data. I personally don't like the "bit encoding" of such values. Note that in your example male/female has lower weight (distance 0 or 1) of the banks attribute (distance 0 or sqrt(2) in euclidean)! – Has QUIT--Anony-Mousse Nov 30 '12 at 09:24
  • @HasQUIT--Anony-Mousse By mapping `male/female` to `0/1`, or `bank1/bank2` to `100/10`, for example, you are saying that it throws off the Euclidean distance because they are weighted differently? Even having `bank1/bank2/bank3` map to `0/1/2` is bad because bank3 has a different weight? Then categorical data should not be mapped like that. – Scott Apr 01 '22 at 01:56
3

The most straight forward way to convert categorical data into numeric is by using indicator vectors. See the reference I posted at my previous comment.

Shai
  • 111,146
  • 38
  • 238
  • 371
  • Thanks Shai - I'm getting a bit weighed down by the technical details in your paper. Please see the comments below. – Graham Nov 30 '12 at 08:25
  • 2
    For each categorical variable, just create n dimensions where the variable takes n possible values. Each one of these dimensions corresponds to one particular value, and it can either be 0 (not present), or 1 (present). Thus your n-way categorical variable is now n binary features. Now you can use Euclidean distance, or any other metric you like – Ben Allison Dec 05 '12 at 13:22
  • @BenAllison If your vectors are binary ones (i.e., consisting of only zeros and ones) a good distance measure can be the **Hamming** distance: http://en.wikipedia.org/wiki/Hamming_distance. It can be very easily and efficiently computed. – Shai Dec 05 '12 at 14:03
  • Right, in this case I think you're going to have these binary features mixed with continuous ones though (see OP). But yes, there are plenty of metrics you could use, or even do some metric learning as @Anony-Mousse suggests (although I'd first attempt standard metrics before going down this route) – Ben Allison Dec 05 '12 at 14:15
1

Can we use Locality Sensitive Hashing (LSH) + edit distance and assume that every bin represents a different category? I understand that categorical data does not show any order and the bins in LSH are arranged according to a hash function. Finding the hash function that gives a meaningful number of bins sounds to me like learning a metric space.

omarflorez
  • 355
  • 4
  • 4