I'm a new user to R, trying to move away from SAS. I'm asking this question here as I'm feeling a bit frustrated with all the packages and sources available for R, and I cant seem to get this working mainly due to data size.
I have the following:
A table called SOURCE in a local MySQL database with 200 predictor features and one class variable. The table has 3 million records and is 3GB large. The number of instances per class are not equal.
I want to:
- A randomly sample the SOURCE database to create a smaller dataset of with equal number of instances per class.
- Divide sample into training and testing set.
- Preform k-means clustering on training set to determine k centroids per class.
- Preform k-NN classification of test data with centroids.