5

I have this data

enter image description here

I am trying to apply this:

one_hot = pd.get_dummies(df)

But I get this error:

enter image description here

Here is my code up until then:

# Import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import tree
df = pd.read_csv('AllMSAData.csv')
df.head()
corr_matrix = df.corr()
corr_matrix
df.describe()
# Get featurs and targets
labels = np.array(df['CurAV'])
# Remove the labels from the features
# axis 1 refers to the columns
df = df.drop('CurAV', axis = 1)
# Saving feature names for later use
feature_list = list(df.columns)
# Convert to numpy array
df = np.array(df)

1 Answers1

3

IMO, the documentation should be updated, because it says pd.get_dummies accepts data that is array-like, and a 2-D numpy array is array like (despite the fact that there is no formal definition of array-like). However, it seems to not like multi-dimensional arrays.

Take this tiny example:

>>> df
   a  b  c
0  a  1  d
1  b  2  e
2  c  3  f

You can't get dummies on the underlying 2D numpy array:

>>> pd.get_dummies(df.values)

Exception: Data must be 1-dimensional

But you can get dummies on the dataframe itself:

>>> pd.get_dummies(df)
   b  a_a  a_b  a_c  c_d  c_e  c_f
0  1    1    0    0    1    0    0
1  2    0    1    0    0    1    0
2  3    0    0    1    0    0    1

Or on the 1D array underlying an individual column:

>>> pd.get_dummies(df['a'].values)
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
sacuL
  • 49,704
  • 8
  • 81
  • 106
  • What would you recommend for my case then? –  Dec 01 '18 at 00:52
  • I noticed that when I call pd.get_dummies(df) before the features and targets part I do not get an error but then it does nothing to the data –  Dec 01 '18 at 00:53
  • use `pd.get_dummies(df[['columns', 'to', 'dummify']])` – sacuL Dec 01 '18 at 00:53
  • 2
    KeyError: "['columns' 'to' 'dummify'] not in index" –  Dec 01 '18 at 00:54
  • That was meant as a placeholder, replace columns to dummify with the columns you want. For example, if you want to get dummies for `State` and `Prev_CS_Tier`, use `pd.get_dummies(df[['State', 'Prev_CS_Tier']])` – sacuL Dec 01 '18 at 00:57
  • So I have to specify each column I want to get dummies for? –  Dec 01 '18 at 00:58
  • That would work, or you could use the `columns` argument. Take a look at the [docs](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) for that argument, it explains it better than I can... but if you just do `pd.get_dummies(df)` *before* you transform `df` into a numpy array, it will just convert all `object` and `category` columns to dummies, which *might* be what you're looking for (but you should think about your data; for instance, I would personally convert `Prev_CS_Tier` to an ordinal, rather than a dummy) – sacuL Dec 01 '18 at 01:06
  • Still not working but no worries, I will figure it out –  Dec 01 '18 at 01:08
  • I just want to change all my categorical data and nothing to the other numerical data variables but this is still doing nothing –  Dec 01 '18 at 01:15
  • You'll need to concatenate the result into your original dataframe. `get_dummies` is not done in place, it returns its own dataframe. – sacuL Dec 01 '18 at 01:16
  • Okay, should I use get_dummies() before or after I convert my data into an array? –  Dec 01 '18 at 01:17
  • Before. See earlier comments and the answer I posted – sacuL Dec 01 '18 at 01:18
  • Okay after I create my dummies before I put my data into a training and testing set should I drop all my categorical variables otherwise if I do not I will get errors –  Dec 01 '18 at 01:34