4

I have a Pandas data frame which contains a series of lists. I would like to use SciKit-Learn's OneHotEncoder on this series. I keep getting a value error.

My problem is reproduced as:

import pandas as pd
import numpy as np

d = {'A': [[5,7], [3, 4, 5], [2], [1,2,3,4]]}
df = pd.DataFrame(data=d)
df
      A
0   [5, 7]
1   [3, 4, 5]
2   [2]
3   [1, 2, 3, 4]

a = np.array(df['A'])
a
array([list([5, 7]), list([3, 4, 5]), list([2]), list([1, 2, 3, 4])],
      dtype=object)

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(sparse = False)

X = enc.fit_transform(a)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-47-64181a9f7331> in <module>()
----> 1 X = enc.fit_transform(a)

~\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py in fit_transform(self, X, y)
   2017         """
   2018         return _transform_selected(X, self._fit_transform,
-> 2019                                    self.categorical_features, copy=True)
   2020 
   2021     def _transform(self, X):

~\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py in _transform_selected(X, transform, selected, copy)
   1807     X : array or sparse matrix, shape=(n_samples, n_features_new)
   1808     """
-> 1809     X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)
   1810 
   1811     if isinstance(selected, six.string_types) and selected == "all":

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    431                                       force_all_finite)
    432     else:
--> 433         array = np.array(array, dtype=dtype, order=order, copy=copy)
    434 
    435         if ensure_2d:

ValueError: setting an array element with a sequence.

I am using windows 10, python 3.6.4, SciKit-Learn 0.19.1

Many thanks for any ideas anyone has!

Michael
  • 87
  • 2
  • 9

1 Answers1

2

For list item , you should using MultiLabelBinarizer in sklearn

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
print (pd.DataFrame(mlb.fit_transform(df['A']),columns=mlb.classes_, index=df.index))
   1  2  3  4  5  7
0  0  0  0  0  1  1
1  0  0  1  1  1  0
2  0  1  0  0  0  0
3  1  1  1  1  0  0
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Thank you for the answer. I have tried the mlb solution, but my actual dataset has 1 millíon examples and I get a memory error, so I wanted to look for other solutions. Also, I do want to know what i am doing wrong with OneHotEncoder. – Michael Apr 25 '18 at 21:01
  • @Michael then just try chunk by chunk to finish the process – BENY Apr 25 '18 at 21:05
  • Are you able to give me an example of code to implement this please? – Michael Apr 25 '18 at 21:55
  • @Michael something like this https://stackoverflow.com/questions/33542977/pandas-groupby-with-sum-on-large-csv-file – BENY Apr 25 '18 at 21:57