13

I'm a new user to R, trying to move away from SAS. I'm asking this question here as I'm feeling a bit frustrated with all the packages and sources available for R, and I cant seem to get this working mainly due to data size.

I have the following:

A table called SOURCE in a local MySQL database with 200 predictor features and one class variable. The table has 3 million records and is 3GB large. The number of instances per class are not equal.

I want to:

  1. A randomly sample the SOURCE database to create a smaller dataset of with equal number of instances per class.
  2. Divide sample into training and testing set.
  3. Preform k-means clustering on training set to determine k centroids per class.
  4. Preform k-NN classification of test data with centroids.
Community
  • 1
  • 1
erichfw
  • 344
  • 3
  • 15
  • 5
    Welcome to So! I suggest you : **RMysqlite** package to extract your data , *sample* function(**base** package) for sampling! *kmeans * function(**base** package) ! *knn* function(**class** package) – agstudy Dec 02 '12 at 18:37
  • How to handle large data? The problem with the database, pre-sampling being kept in memory. Only have 4Gb ram. – erichfw Dec 02 '12 at 18:58
  • 2
    Try using the db engine to perform the random selection: http://stackoverflow.com/q/580639/269476. – James Dec 02 '12 at 19:51
  • 1
    @entropy Have you looked at the `ff` package? The datatype `ffdf` is like a `data.frame`, but stores on disk instead of in memory. – Matthew Plourde Dec 02 '12 at 19:52
  • Do you mean 'k-nn classification of test data with centroids'. This isn't the usual way k-nn works you usually just supply k- the no of neighbours not centroids of the classes. Are you thinking of some variant of k-nn? – Stephen Henderson Dec 02 '12 at 23:44
  • Or you can try `sqldf` to get the random sample. Everything else is just some simple `data.frame` manipulation and using some available functions. – Roman Luštrik Dec 13 '12 at 10:04
  • @StephenHenderson as stated i'm using a very large data set. Straight forward k-NN would not be pragmatic as it does not scale well. I plan on pre-clustering using k-means to reduce the number of records sampled to a set of centroid records. This should reduce the size of the k-NN problem. This method has been used in literature. – erichfw Dec 13 '12 at 16:02
  • have you tried the filehash package? ffdf is also an excellent way to go, as Matthew said! – jackStinger Feb 08 '13 at 05:48
  • I havent worked on this project in a while. Will try to get to it again and give feedback about what worked. – erichfw Mar 26 '14 at 15:24

3 Answers3

3

The way I would proceed is:

1) Extract a list of ids of your table to R, you can do this with a simple SQL query using the RMySQL library.

2) Split the ids in any way you like in R, and then do subsequent SQL queries again using RMySQL (I found this two step approach much quicker than sampling directly in MySQL).

3) Depending on how large is your sample you could get away by using the base R kmeans implementation, this however might fail for bigger samples, in that case you should look into using bigkmeans from the library biganalytics.

ArturoSaCo
  • 319
  • 1
  • 10
0

I can help you for two questions. 1- stratified sampling 2-split training&testing (i.e. calibration validation)

        n = c(2.23, 3.5, 12,2, 93, 57, 0.2,
 33, 5,2, 305, 5.3,2, 3.9, 4) 
     s = c("aa", "bb", "aa","aa", "bb", "cc","aa", "bb",
 "bb","aa", "aa","aa","aa","bb", "cc") 
         id = c(1, 2, 3,4, 5, 6,7, 8, 9,
10, 11, 12,13, 14, 15) 
         df = data.frame(id, n, s )       # df is a data frame

        source("http://news.mrdwab.com/stratified")
        sample<- stratified(df=df, 
                            id=1, #ID of your dataframe, 
                            #if there isn't you have to create it
                            group=3, #the position of your predictor features
                            size=2, #cardinality of selection
                            seed="NULL") 

        #then add a new column to your selection 
        sample["cal_val"]<- 1

        #now, you have a random selection of group 3, 
        #but you need to split it for cal and val, so:

        sample2<- stratified(df=sample, #use your previous selection
                             id=1, 
                             group=3, #sample on the same group used previously
                             size=1,#half of the previous selection
                             seed="NULL")

        sample2["val"]<- 1
        #merge the two selection
        merge<- merge(sample, sample2, all.x=T, by="id")
        merge[is.na(merge)] <- 0 #delete NA from merge
    #create a column where 1 is for calibration and 2 for validation    
    merge["calVal"]<- merge$cal_val.x + merge$cal_val.y 
#now "clean" you dataframe, because you have too many useless columns       
 id<- merge$id  
        n<- merge$n.x 
        s<- merge$s.x
        calval<- merge$calVal
        final_sample<- data.frame(id, n, s, calval)
FraNut
  • 676
  • 1
  • 11
  • 22
0

I think many of your problems could be solved by using the caret package. Regarding the random sampling with equal class membership, I'd push that back into SQL, just running two queries with the size you want for each class specified. Others have mentioned RMySql, RODBC or RJDBC would work as well. For separating data into train and test sets, use the following caret function:

# separate data into test and train sets, 70/30 split in this case

splitIndex <- createDataPartition(mydata$mytargetcolumn, p = 0.7, list = FALSE)
train <- mydata[splitIndex, ]
test <- mydata[-splitIndex, ]
testInd <- test[ ,!colnames(test) %in% "mytargetcolumn"]
testDep <- as.factor(test[, names(test) == "mytargetcolumn"]) 

You can also do your KNN using caret, like this:

modelKNN <- knn3(mytargetcolumn ~ ind1 + ind2, data = train, k = neighborCount, prob = TRUE)

and then prediction is easy:

# prediction using KNN to get class probabilities, change 'type' if you just want class prediction

predKNN <- predict(modelKNN, testInd, type = "prob")

You can also use caret for evaluation:

# Generate confusion matrix from class predictions and actual values

confKNN <- confusionMatrix(testDep, predKNN)

Though personally I use AUC (via the pROC package) for classification model evaluation as it's a more fine grained measure of classifier strength than accuracy.

TomR
  • 546
  • 8
  • 19