0

Question updated!!

I have 15 columns of categorical variables and I want the correlation among them. The data set is 20,000+ long and the data set looks like this:

state | job | hair_color | car_color | marital_status
NY    | cs  | brown      | blue      | s
FL    | mt  | black      | blue      | d
NY    | md  | blond      | white     | m
NY    | cs  | brown      | red       | s

Notice that 1st row and last row NY, cs, and s repeats. I want to find out that kind of patterns. NY and cs is highly correlated. I need to rank the combination of values in the columns. Hope now the question make sense. Please notice that is NOT counting NY or cs. Is about finding out how many times NY and blond appears together in the same row. I need to do that for all values by row. Hope now this make sense.

I tried to utilize cor() with R but since these are categorical variables the function doesn't work. How can I work with this data set to find the correlation among them?

Community
  • 1
  • 1
redeemefy
  • 4,521
  • 6
  • 36
  • 51
  • Can you clarify what you are trying to measure with `cor()`? For example, is `cor(c("red","blue"), c("red","yellow"))` higher than, the same as, or lower than `cor(c("red","blue"), c("red","brown"))`? – Weihuang Wong Aug 08 '16 at 17:56
  • No, is not ordinary. For id 1 I have 15 colors, for id 2 other 15 colors, and I have 20,000 id's. Colors don't repeat by id. I want to find how each color correlates with the other colors. With `cor()`, R returns a table matrix with all the variables and how each variable correlates. Colors variable are not ordinary, they're just categorical. Make sense what I'm trying to do? – redeemefy Aug 08 '16 at 18:05
  • Yes, but for 16 variables instead of 2. – redeemefy Aug 09 '16 at 01:41

1 Answers1

0

You may wish to refer to Ways to calculate similarity. Suppose your data is

d <- structure(list(state = structure(c(2L, 1L, 1L, 2L, 2L), .Label = c("FL", 
"NY"), class = "factor"), job = structure(c(2L, 1L, 4L, 3L, 2L
), .Label = c("bs", "cs", "md", "mt"), class = "factor"), hair_color = structure(c(3L, 
3L, 1L, 2L, 3L), .Label = c("black", "blond", "brown"), class = "factor"), 
    car_color = structure(c(1L, 2L, 1L, 3L, 2L), .Label = c("blue", 
    "red", "white"), class = "factor"), marital_status = structure(c(3L, 
    1L, 1L, 2L, 3L), .Label = c("d", "m", "s"), class = "factor")), .Names = c("state", 
"job", "hair_color", "car_color", "marital_status"), class = "data.frame", row.names = c(NA, 
-5L))

Data:

> d
  state job hair_color car_color marital_status
1    NY  cs      brown      blue              s
2    FL  bs      brown       red              d
3    FL  mt      black      blue              d
4    NY  md      blond     white              m
5    NY  cs      brown       red              s

We can calculate the "dissimilarities" between observations:

library(cluster)
daisy(d, metric = "euclidean")

Output:

> daisy(d, metric = "euclidean")
Dissimilarities :
    1   2   3   4
2 0.8            
3 0.8 0.6        
4 0.8 1.0 1.0    
5 0.2 0.6 1.0 0.8

Metric :  mixed ;  Types = N, N, N, N, N 
Number of objects : 5

which tells us that observations 1 and 5 are least dissimilar. With many observations, it is obviously impossible to visually inspect the dissimilarity matrix, but we can filter out the pairs that fall below a certain threshold, e.g.

out <- daisy(d, metric = "euclidean")
pairs <- expand.grid(2:5, 1:4)
pairs <- pairs[pairs[,1]!=pairs[,2],]
similars <- pairs[which(out<.8),]

Given a threshold of 0.8,

> similars
  Var1 Var2
4    5    1
6    3    2
8    5    2
Community
  • 1
  • 1
Weihuang Wong
  • 12,868
  • 2
  • 27
  • 48