I'm trying to build a basic character recognition model using the many classifiers that scikit provides. The dataset being used is a standard handwritten set of alphanumeric samples (Chars74K image dataset taken from this source: EnglishHnd.tgz).
There are 55 samples of each character (62 alphanumeric characters in all), each being 900x1200 pixels. I'm flattening the matrix (first converting to grayscale) into a 1x1080000 array (each representing a feature).
for sample in sample_images: # sample images is the list of the .png files
img = imread(sample);
img_gray = rgb2gray(img);
if n == 0 and m == 0: # n and m are global variables
n, m = np.shape(img_gray);
img_gray = np.reshape(img_gray, n*m);
img_gray = np.append(img_gray, sample_id); # sample id stores the label of the training sample
if len(samples) == 0: # samples is the final numpy ndarray
samples = np.append(samples, img_gray);
samples = np.reshape(samples, [1, n*m + 1]);
else:
samples = np.append(samples, [img_gray], axis=0);
So the final data structure should have 55x62 arrays, where each array is 1080000 elements in capacity. Only the final structure is being stored (the scope of the intermediate matrices is local).
The amount of data being stored to learn the model is pretty large (I guess), because the program isn't really progressing beyond a point, and crashed my system to the extent that the BIOS had to be repaired!
Upto this point, the program is only gathering the data to send to the classifier ... the classification hasn't even been introduced into the code yet.
Any suggestions as to what can be done to handle the data more efficiently?
Note: I'm using numpy to store the final structure of flattened matrices. Also, the system has an 8Gb RAM.