1

I'm trying to build a basic character recognition model using the many classifiers that scikit provides. The dataset being used is a standard handwritten set of alphanumeric samples (Chars74K image dataset taken from this source: EnglishHnd.tgz).

There are 55 samples of each character (62 alphanumeric characters in all), each being 900x1200 pixels. I'm flattening the matrix (first converting to grayscale) into a 1x1080000 array (each representing a feature).

for sample in sample_images: # sample images is the list of the .png files
    img = imread(sample);
    img_gray = rgb2gray(img);
    if n == 0 and m == 0: # n and m are global variables
        n, m = np.shape(img_gray);
    img_gray = np.reshape(img_gray, n*m);
    img_gray = np.append(img_gray, sample_id); # sample id stores the label of the training sample
    if len(samples) == 0: # samples is the final numpy ndarray
        samples = np.append(samples, img_gray);
        samples = np.reshape(samples, [1, n*m + 1]);
    else:
        samples = np.append(samples, [img_gray], axis=0);

So the final data structure should have 55x62 arrays, where each array is 1080000 elements in capacity. Only the final structure is being stored (the scope of the intermediate matrices is local).

The amount of data being stored to learn the model is pretty large (I guess), because the program isn't really progressing beyond a point, and crashed my system to the extent that the BIOS had to be repaired!

Upto this point, the program is only gathering the data to send to the classifier ... the classification hasn't even been introduced into the code yet.

Any suggestions as to what can be done to handle the data more efficiently?

Note: I'm using numpy to store the final structure of flattened matrices. Also, the system has an 8Gb RAM.

tiny_rick
  • 65
  • 1
  • 6

2 Answers2

2

This seems like a case of stack overflow. You have 3,682,800,000 array elements, if I understand your question. What is the element type? If it is one byte, that is about 3 gigabytes of data, easily enough to fill up your stack size (usually about 1 megabyte). Even with one bit an element, you are still at 500 mb. Try using heap memory (up to 8 gigs on your machine)

パスカル
  • 479
  • 4
  • 13
  • Yeah! This is probably it. In addition, a good trick of the trade with processing and learning on image data is that the streaming model of computation is always going to be better. Try to learn on your image one at a time and then read in the next image and forget the last one. Hope this helps. – jlarks32 Feb 28 '17 at 04:32
  • Each element is an int (sys.getsizeof(int()) returns 24) ... so it's actually way more than 3gigs! What is surprising me is that when i tried running a similar sized dataset in octave, there weren't any issues ... is that strange? – tiny_rick Feb 28 '17 at 04:46
  • @jlarks32: do you mean creating 62 different classifiers and checking a "one v/s all" scenario in each of them OR dynamically updating the model weights (like an online learning type deal) as each sample is being observed? – tiny_rick Feb 28 '17 at 04:50
  • @AtulyaUrankar octave might have some weird compression algorithm? I'm not sure - I do think that's strange. I mean the later. More of dynamically updating your models weights. So more of like you iterate over the directory only holding one picture on the stack at a time. – jlarks32 Feb 28 '17 at 04:56
  • 1
    This means your structure is 9 gigs. You need to compress, or restructure. Otherwise not even heap can hold all of this. Try @jlarks32 suggestion, who should post it as an answer – パスカル Feb 28 '17 at 05:09
  • 1
    @jlarks32: yup ... that never occurred to me. The first one (62 separate classifiers) did strike me, but there's no way of ensuring that the approach would even be correct. Thanks a lot, I'll give it a try ... never tried this before :) – tiny_rick Feb 28 '17 at 05:23
  • Just increasing the RAM helped – Anupam Jan 19 '22 at 07:11
0

I was encouraged to post this as a solution, although the comments above are probably more enlightening.

The issue with the users program is two fold. Really it's just overwhelming the stack.

Much more common, especially with image processing in things like computer graphics or computer vision, is to process the images one at a time. This could work well with sklearn where you could just be updating your models as you read in the image.

You could use this bit of code found from this stack article:

import os
rootdir = '/path/to/my/pictures'

for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        if file[-3:] == 'png': # or whatever your file type is / some check
             # do your training here
             img = imread(file)

             img_gray = rgb2gray(img)
             if n == 0 and m == 0: # n and m are global variables
                 n, m = np.shape(img_gray);
             img_gray = np.reshape(img_gray, n*m)

             # sample id stores the label of the training sample
             img_gray = np.append(img_gray, sample_id) 

             # samples is the final numpy ndarray
             if len(samples) == 0: 
                 samples = np.append(samples, img_gray);
                 samples = np.reshape(samples, [1, n*m + 1])
             else:
                 samples = np.append(samples, [img_gray], axis=0)

This is more of pseudocode, but the general flow should have the right idea. Let me know if there's anything else I can do! Also check out OpenCV if you're interested on some cool deep learning algorithms. They're a bunch of cool stuff there and images make for great sample data.

Hope this helps.

Community
  • 1
  • 1
jlarks32
  • 931
  • 8
  • 20