9

I have a pandas dataframe similar to this:

  Col1   ABC
0  XYZ    A
1  XYZ    B
2  XYZ    C

By using the pandas get_dummies() function on column ABC, I can get this:

  Col1   A   B   C
0  XYZ   1   0   0
1  XYZ   0   1   0
2  XYZ   0   0   1

While I need something like this, where the ABC column has a list / array datatype:

  Col1    ABC
0  XYZ    [1,0,0]
1  XYZ    [0,1,0]
2  XYZ    [0,0,1]

I tried using the get_dummies function and then combining all the columns into the column which I wanted. I found lot of answers explaining how to combine multiple columns as strings, like this: Combine two columns of text in dataframe in pandas/python. But I cannot figure out a way to combine them as a list.

This question introduced the idea of using sklearn's OneHotEncoder, but I couldn't get it to work. How do I one-hot encode one column of a pandas dataframe?

One more thing: All the answers I came across had solutions where the column names had to be manually typed while combining them. Is there a way to use Dataframe.iloc() or splicing mechanism to combine columns into a list?

Nir_J
  • 133
  • 1
  • 3
  • 7
  • "where the ABC column has a list / array datatype:" why? – juanpa.arrivillaga Nov 05 '17 at 22:39
  • 3
    Possible duplicate of [Combine columns in a Pandas DataFrame to a column of lists in a DataFrame](https://stackoverflow.com/questions/27145148/combine-columns-in-a-pandas-dataframe-to-a-column-of-lists-in-a-dataframe) – andrew_reece Nov 05 '17 at 23:24

4 Answers4

8

Here is an example of using sklearn.preprocessing.LabelBinarizer:

In [361]: from sklearn.preprocessing import LabelBinarizer

In [362]: lb = LabelBinarizer()

In [363]: df['new'] = lb.fit_transform(df['ABC']).tolist()

In [364]: df
Out[364]:
  Col1 ABC        new
0  XYZ   A  [1, 0, 0]
1  XYZ   B  [0, 1, 0]
2  XYZ   C  [0, 0, 1]

Pandas alternative:

In [370]: df['new'] = df['ABC'].str.get_dummies().values.tolist()

In [371]: df
Out[371]:
  Col1 ABC        new
0  XYZ   A  [1, 0, 0]
1  XYZ   B  [0, 1, 0]
2  XYZ   C  [0, 0, 1]
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • After getting the column of lists, I was able to convert list to array using: `df['new'].apply(lambda x: np.array(x))` . Is there a direct way to get array? – Nir_J Nov 05 '17 at 23:24
  • @Nir_J, i don't know how to assign Numpy 2D array to a single Pandas column directly. Pandas will think that we are assigning multiple columns... Actually that's why i used `.tolist()` – MaxU - stand with Ukraine Nov 05 '17 at 23:29
5

You can just use tolist():

df['ABC'] = pd.get_dummies(df.ABC).values.tolist()

  Col1        ABC
0  XYZ  [1, 0, 0]
1  XYZ  [0, 1, 0]
2  XYZ  [0, 0, 1]
andrew_reece
  • 20,390
  • 3
  • 33
  • 58
2

If you have a pd.DataFrame like this:

>>> df
  Col1  A  B  C
0  XYZ  1  0  0
1  XYZ  0  1  0
2  XYZ  0  0  1

You can always do something like this:

>>> df.apply(lambda s: list(s[1:]), axis=1)
0    [1, 0, 0]
1    [0, 1, 0]
2    [0, 0, 1]
dtype: object

Note, this is essentially a for-loop on the rows. Note, columns do not have list data-types, they must be object, which will make your data-frame operations not able to take advantage of the speed benefits of numpy.

juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • Thank you for pointing out the disadvantage of list. I wanted to be able to use just that one column as label to train the model. Will this solution be able to take speed benefits of numpy? – Nir_J Nov 05 '17 at 22:56
  • @Nir_J no. I'm not sure models in `sklearn` will accept a column of `list` objects anyway. – juanpa.arrivillaga Nov 05 '17 at 22:59
0

if you have a data-frame df with categorical column ABC then you could use to create a new column of one-hot vectors

df['new_column'] = list(pandas.get_dummies(df['AB]).get_values())
Spandyie
  • 914
  • 2
  • 11
  • 23