1

I want to convert something like this:

['dog', 'cat', 'fish', 'dog', 'dog', 'bird', 'cat', 'bird']

Into a boolean matrix, one column in the matrix for each classification. For this example, it'd be like this:

(dog) (cat) (fish) (bird)
  1     0      0     0
  0     1      0     0
  0     0      1     0
  1     0      0     0
  1     0      0     0
  0     0      0     1 
  0     1      0     0
  0     0      0     1  

Where the value is set to true depending on the classification. I know I could do this iteratively like this (pseudo code):

class = array of classifications
new = array of size [amt of classifications, len(class)]
for i, c in enumerate(class):
    if c == 'dog':
        new[i][0] = 1
    elif c == 'cat':
        new[i][1] = 1
    # and so on

I feel there's more efficient way of doing that within numpy, or pandas (since i originally have the data as a DataFrame the convert it to a numpy array, so i wouldn't mind having a pandas-solution).

Mauricio Martinez
  • 337
  • 1
  • 2
  • 9
  • 1
    Possible duplicate: [How can I one hot encode in Python?](https://stackoverflow.com/q/37292872/190597) – unutbu Feb 27 '18 at 16:54

1 Answers1

1

Use get_dummies which accept list also:

a = ['dog', 'cat', 'fish', 'dog', 'dog', 'bird', 'cat', 'bird']
df = pd.get_dummies(a)
print (df)
   bird  cat  dog  fish
0     0    0    1     0
1     0    1    0     0
2     0    0    0     1
3     0    0    1     0
4     0    0    1     0
5     1    0    0     0
6     0    1    0     0
7     1    0    0     0

If ordering of columns is important add reindex with unique:

df = pd.get_dummies(a).reindex(columns=pd.unique(a))
print (df)
   dog  cat  fish  bird
0    1    0     0     0
1    0    1     0     0
2    0    0     1     0
3    1    0     0     0
4    1    0     0     0
5    0    0     0     1
6    0    1     0     0
7    0    0     0     1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252