0

I have what I thought was a well-prepared dataset. I wanted to use the Apriori Algorithm in R to look for associations and come up with some rules. I have about 16,000 rows (unique customers) and 179 columns that represent various items/categories. The data looks like this:

     Cat1  Cat2  Cat3  Cat4  Cat5 ... Cat179
     1,     0,    0,    0,    1,  ...  0
     0,     0,    0,    0,    0,  ...  1
     0,     1,    1,    0,    0,  ...  0
     ...

I thought having a comma separated file with binary values (1/0) for each customer and category would do the trick, but after I read in the data using:

data5 = read.csv("Z:/CUST_DM/data_test.txt",header = TRUE,sep=",")

and then run this command:

rules = apriori(data5, parameter = list(supp = .001,conf = 0.8))

I get the following error:

Error in asMethod(object):
column(s) 1, 2, 3, ...178 not logical or a factor. Discretize the columns first.  

I understand Discretize but not in this context I guess. Everything is a 1 or 0. I've even changed the data from INT to CHAR and received the same error. I also had the customer ID (unique) as column 1 but I understand that isn't necessary when the data is in this form (flat file). I'm sure there is something obvious I'm missing - I'm new to R.

What am I missing? Thanks for your input.

Benny
  • 2,233
  • 1
  • 22
  • 27
  • 1
    Please read [How to make a great reproducible example in R?](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – M-- Jun 21 '17 at 16:03
  • It's really not possible to help you without a reproducible example. It sounds like there's a problem with your data but without being able to reproduce the problem, we can't say what's wrong for sure. – MrFlick Jun 21 '17 at 16:15
  • Fair enough. Can you tell me this, is the file format of 1's and 0's, comma separated an acceptable format for apriori? And do I need a unique ID column - I understand I do not once it is in flat file format? The answer to those two question will eliminate a few potential problems I think. Thanks. – CalData Jun 21 '17 at 18:30
  • I solved the problem this way: After reading in the data to R I used lapply() to change the data to factors (I think that's what it does). Then I took that data set and created a data frame from it. Then I was able to apply apriori() successfully. – CalData Jun 21 '17 at 23:32

2 Answers2

0

I solved the problem this way: After reading in the data to R I used lapply() to change the data to factors (I think that's what it does). Then I took that data set and created a data frame from it. Then I was able to apply apriori() successfully.

0

Your data is actually already in (dense) matrix format, but read.csv always reads data in as a data.frame. Just coerce the data to a matrix first:

dat <- as.matrix(data5)
rules <- apriori(dat, parameter = list(supp = .001,conf = 0.8))

1s in the data will be interpreted as the presence of the item and 0s as the absence. More information about how to create transactions can be found in the manual page ? transactions.

Michael Hahsler
  • 2,965
  • 1
  • 12
  • 16
  • I tried your suggestion Michael and it didn't work for me. The as.matrix command didn't return an error but when I tried to use the apriori() function I received this error: Error in if (any(from != 0 & from != 1)) warning("matrix contains values other than 0 and 1! Setting all entries != 0 to 1.") : (I apologize I haven't done this before and I know the formatting is wrong but I want to let you know my solution. missing value where TRUE/FALSE needed – CalData Jun 27 '17 at 21:06
  • I have a solution: first I changed the 0's to NULL's in my csv data set. Next, I used the lapply(dataname,factor) function, then I converted that dataset into a data.frame. Then I could apply the apriori() function successfully. Thanks for you suggestion. (I apologize, this is my first try at this and I don't know how the formatting works.) – CalData Jun 27 '17 at 21:14
  • The error message says that your matrix contains values that are not 0 or 1. So all you need to do is find out where the "bad" values are. For example, you can use `table(as.matrix(data5), useNA = "always")` and if you see anything else than 0 and 1 then you know you have a problem. I guess you want to change anything that is not a 1 into a 0, and then you should be fine. Creating factors with `lapply` has the disadvantage in this case that you will get items for all the 0s and also for all the other weird values that you have. – Michael Hahsler Jun 29 '17 at 02:55