5

I am trying to implement the SelectKBest algorithm on my data to get the best features out of it. For this I am first preprocessing my data using DictVectorizer and the data consists of 1061427 rows with 15 features. Each feature has many different values and I believe I am getting a memory error due to high cardinality.

I get the following error:

File "FeatureExtraction.py", line 30, in <module>
    quote_data = DV.fit_transform(quote_data).toarray()
File "/usr/lib64/python2.6/site-packages/scipy/sparse/compressed.py", line 563, in toarray
    return self.tocoo(copy=False).toarray()
File "/usr/lib64/python2.6/site-packages/scipy/sparse/coo.py", line 233, in toarray
    B = np.zeros(self.shape, dtype=self.dtype)
MemoryError

Is there any alternate way that I could do this? Why do I get a memory error when I am processing on a machine that has 256GB of RAM.

Any Help is appreciated!

Korem
  • 11,383
  • 7
  • 55
  • 72
Gayatri
  • 2,197
  • 4
  • 23
  • 35
  • 2
    Seems like your error comes from the `toarray` method and not from `DictVectorizer`. Do you have to turn it to array? – Korem Jun 17 '14 at 11:40
  • Yes.I have to convert it to an array.Is there any other way of doing it? – Gayatri Jun 17 '14 at 18:52
  • @TalKremerman I also tried removing the toarray() and changing the argument of Sparse= False and I still get the same error.Here is the code: DV = DictVectorizer(sparse=False) data = DV.fit_transform(data) and earlier I had written DV = DictVectorizer(sparse=True) data = DV.fit_transform(data).toarray() . Either way it is giving a two dimensional array of 0's and 1's which is what I need to input to SelectKBest. – Gayatri Jun 17 '14 at 19:23
  • How about trying with a [pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)? – Korem Jun 17 '14 at 21:47
  • @TalKremerman That doesn't change anything. – Fred Foo Jun 18 '14 at 10:29
  • @user2217866 Which feature selection metric are you using? – Fred Foo Jun 18 '14 at 10:30
  • @larsmans What do you mean by feature selection metric? Following is my code: data = list(csv.DictReader(open("Data.csv",mode='rU'))) target = list(csv.reader(open("target.csv",mode='rU'))) # Line which removes the square brackets in between the labels target = list(itertools.chain(*target)) print "size of the file is", len(target) # Convert the string values into continuous value outputs DV = DictVectorizer(sparse=False) print type(DV) data = DV.fit_transform(data) <-- This line of code is where I get the memory error. The codes works fine on a data file of 15000 records. – Gayatri Jun 18 '14 at 14:43
  • @larsmans I was simply offering ideas - had I known the solution I would've posted an answer – Korem Jun 18 '14 at 14:53
  • @user2217866: `f_classif`, `chi2`? The latter can work with sparse matrices. – Fred Foo Jun 18 '14 at 17:52
  • f_classif, chi2 are feature selection methods like SelectKBest. But the problem I am getting is with the DictVectorizer which is feature extraction. The DictVectorizer is converting feature values into 0's and 1's and because of the large number of values(high cardinality), it is unable to store I guess. – Gayatri Jun 18 '14 at 19:27
  • Are you using Python 64 bits? Most memory errors are because of the 32 bits limit. Having 256 GB RAM should not be an issue. – Davidmh Jun 19 '14 at 21:48
  • Yes. I am using a 64 bit version for Python. – Gayatri Jun 19 '14 at 22:03
  • What does your memory usage look like before the transform? If it is creating an array of zeros then trying to copy, it is (temporarily, at least) doubling your memory usage... judging by this line. B = np.zeros(self.shape, dtype=self.dtype) Have you tried just calling fit(), then chunking the data up and using transform() on the smaller chunks? At very least the temporary array allocation would be much smaller. – Kyle Kastner Jun 20 '14 at 09:08
  • How many feature are produced when you use the 15000 record data-set? Maybe you could also try 5000 and 10000 record data-sets and see if there is a trend in number of features created? – DavidS Jun 21 '14 at 00:57

7 Answers7

6

I figured out the problem.

When I removed a column which had a very high cardinality the DictVectorizer works fine. That column had like millions of different unique values and hence the dictvectorizer was giving a memory error.

Gayatri
  • 2,197
  • 4
  • 23
  • 35
  • 4
    Removing a column with a very high cardinality is not a solution and this wasn't the root problem. The problem was `toarray()`. DictVetorizer from sklearn is designed for this purpose - vectorizing categorical features with high cardinality. Read my comment below. – Serendipity Nov 29 '15 at 13:18
4

The problem was toarray(). DictVetorizer from sklearn (which is designed for vectorizing categorical features with high cardinality) outputs sparse matrices by default. You are running out of memory because you require the dense representation by calling fit_transform().toarray().

Just use:

quote_data = DV.fit_transform(quote_data)
Serendipity
  • 2,216
  • 23
  • 33
2

If your data has high cardinality because it represents text, you can try using a resource-friendlier vectorizer like HashingVectorizer

szxk
  • 1,769
  • 18
  • 35
1

I was using DictVectorizer to transform categorical database entries into one hot vectors and was continually getting this memory error. I was making the following fatal flaw: d = DictVectorizer(sparse=False). When I would call d.transform() on some of the fields with 2000 or more categories, python would crash. The solution that worked was to instantiate DictVectorizer with sparse being True, which by the way is default behavior. If you are doing one hot representations of items with many categories, dense arrays are not the most efficient structure to use. Calling .toArray() in this case is very inefficient.

The purpose of the one hot vector in matrix multiplication is to select a row or column from some matrix. This can be done more efficiently simply by using the indices where a 1 exists in the vector. This is an implicit form of multiplication, that requires orders of magnitude less operations than is required of the explicit multiplication.

Jt oso
  • 57
  • 1
1

While performing fit_transform, instead of passing the whole dictionary to it, create a dictionary with only unique occurences. Here is the an example:

Transform dictionary:

Before

[ {A:1,B:22.1,C:Red,D:AB12},
      {A:2,B:23.3,C:Blue,D:AB12},
  {A:3,B:20.2,C:Green,D:AB65},
    ]

After

    [ {A:1,B:22.1,C:Red,D:AB12},
      {C:Blue},
  {C:Green,D:AB65},
    ]

This saves a lot of space.

Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
icm
  • 274
  • 1
  • 5
  • 11
1

@Serendipity Using the fit_transform function, I also runned into the memory error. And removing a column was in my case not an option. So I removed .toarray() and the code worked fine.

I run two tests using a smaller dataset with and without the .toarray() option and in both cases it produced an identical matrix.

In short, removing .toarray() did the job!

Rens
  • 492
  • 1
  • 5
  • 14
0

In addition to the above answers, you may as well try using the storage-friendly LabelBinarizer() function to build your own custom vectorizer. Here is the code:

from sklearn.preprocessing import LabelBinarizer

def dictsToVecs(list_of_dicts):
    X = []

    for i in range(len(list_of_dicts[0].keys())):
        vals = [list(dict.values())[i] for dict in list_of_dicts]

        enc = LabelBinarizer()
        vals = enc.fit_transform(vals).tolist()
        print(vals)
        if len(X) == 0:
            X = vals
        else:
            dummy_res = [X[idx].extend(vals[idx]) for idx, element in enumerate(X)]

    return X

Further, in case of distinct train-test data sets, it could be helpful to save the binarizer instances for each item of the dictionaries once fitted at the train time, so as to call the transform() method by loading these at the test time.

Saurav--
  • 1,530
  • 2
  • 15
  • 33