Convert categorical data into dummy set

Question

I'm having data like this:-

|--------|---------|
| Col1   | Col2    |
|--------|---------|
| X      | a,b,c   |
|--------|---------|
| Y      | a,b     |
|--------|---------|
| X      | b,d     |
|--------|---------|

I want to convert these categorical data to dummy variables. Since my data is large its giving memory error if i'm using get_dummies() from pandas. I want my result like this:-

|------|------|------|------|------|------|
|Col_X |Col_Y |Col2_a|Col2_b|Col2_c|Col2_d|
|------|------|------|------|------|------|
|  1   |  0   |  1   |  1   |  1   |  0   |
|------|------|------|------|------|------|
|  0   | 1    |  1   |  1   |  0   |   0  |
|------|------|------|------|------|------|
|  1   | 0    |  0   |  1   |  0   |   1  |
|------|------|------|------|------|------|

I have tried to convert Col2 using this but getting MemoryError as data is large and there is lot of variability in col2 too.

So,

1) How can I convert multiple categorical columns into dummy variable?

2) pandas get_dummy() is giving memory error, so how could i handle that?

You can preprocess the data first. If you are working on ML, you can remove stopwords and apply stemming & lemmatization which should reduce the count of unique words in the data. — Vaishali, Feb 13 '19 at 13:42
Yes, I need to use this data for ML, but I'm not working on text data. These are some codes in str. @Vaishali — Tarun, Feb 13 '19 at 13:50

score 2 · Answer 1 · answered Feb 14 '19 at 01:32

I'm almost positive that you're encountering memory issues because str.get_dummies returns an array full of 1s and 0s, of datatype np.int64. This is quite different from the behavior of pd.get_dummies, which returns an array of values of datatype uint8.

This appears to be a known issue. However, there's been no update, nor fix, for the past year. Checking out the source code for str.get_dummies will indeed confirm that it is returning np.int64.

An 8 bit integer will take up 1 byte of memory, while a 64 bit integer will take up 8 bytes. I'm hopeful that memory problems can be avoided by finding an alternative way to one-hot encode Col2 which ensures the output are all 8 bit integers.

Here was my approach, beginning with your example:

df = pd.DataFrame({'Col1': ['X', 'Y', 'X'],
                   'Col2': ['a,b,c', 'a,b', 'b,d']})
df

    Col1    Col2
0   X       a,b,c
1   Y       a,b
2   X       b,d

Since Col1 contains simple, non-delimited strings, we can easily one-hot encode it using pd.get_dummies:

df = pd.get_dummies(df, columns=['Col1'])
df

    Col2    Col1_X  Col1_Y
0   a,b,c        1       0
1   a,b          0       1
2   b,d          1       0

So far so good.

df['Col1_X'].values.dtype
dtype('uint8')

Let's get a list of all unique substrings contained inside the comma-delimited strings in Col2:

vals = list(df['Col2'].str.split(',').values)
vals = [i for l in vals for i in l]
vals = list(set(vals))
vals.sort()
vals

['a', 'b', 'c', 'd']

Now we can loop through the above list of values and use str.contains to create a new column for each value, such as 'a'. Each row in a new column will contain 1 if that row actually has the new column's value, such as 'a', inside its string in Col2. As we create each new column, we make sure to convert its datatype to uint8:

col='Col2'
for v in vals:
    n = col + '_' + v
    df[n] = df[col].str.contains(v)
    df[n] = df[n].astype('uint8')

df.drop(col, axis=1, inplace=True)
df

    Col1_X  Col1_Y  Col2_a  Col2_b  Col2_c  Col2_d
0        1       0       1       1       1       0
1        0       1       1       1       0       0
2        1       0       0       1       0       1

This results in a dataframe that meets your desired format. And thankfully, the integers in the four new columns that were one-hot encoded from Col2 only take up 1 byte each, as opposed to 8 bytes each.

df['Col2_a'].dtype
dtype('uint8')

If, on the outside chance, the above approach doesn't work. My advice would be to use str.get_dummies to one-hot encode Col2 in chunks of rows. Each time you do a chunk, you would convert its datatype from np.int64 to uint8, and then transform the chunk to a sparse matrix. You could eventually concatenate all chunks back together.

it's not giving memory error for now. But the process is running for quite long. any thing we can do to speed up the process. — Tarun, Feb 14 '19 at 06:38
also its giving `cannot convert float NaN to integer` as some of the rows contain null values as well — Tarun, Feb 14 '19 at 06:39
For `Col2`, simply use this: df['Col2'].str.get_dummies(sep=',') — Mehrdad Salimi, May 24 '20 at 10:40

score 1 · Accepted Answer · answered Feb 14 '19 at 12:01

I would like to give my solution as well. And I would like to thank @James-dellinger for the answer. So here is my approach

df = pd.DataFrame({'Col1': ['X', 'Y', 'X'],
               'Col2': ['a,b,c', 'a,b', 'b,d']})
df

  Col1  Col2
0   X   a,b,c
1   Y   a,b
2   X   b,d

I first split Col2 values and convert it into column values.

df= pd.DataFrame(df['Col2'].str.split(',',3).tolist(),columns = ['Col1','Col2','Col3'])

df

   Col1 Col2 Col3
0   a   b    c
1   a   b    None
2   b   d    None

Then I applied dummy creation on this dataframe without giving any prefix.

df=pd.get_dummies(df, prefix="")

df

    _a  _b  _b  _d  _c
0   1   0   1   0   1
1   1   0   1   0   0
2   0   1   0   1   0

Now to get the desired result we can sum up all the duplicate columns.

df.groupby(level=0, axis=1).sum()

df

    _a  _b  _c  _d
0   1   1   1   0
1   1   1   0   0
2   0   1   0   1

For Col1 we can directly create dummy variables using pd.get_dummies() and store it into different dataframe suppose col1_df. We can concat both columns using pd.concat([df,col1_df], axis=1, sort=False)

I like this approach. I had initially tried something similar, but got stuck when it didn't occur to me that I could leave prefix blank and then group columns by identical suffixes. — James Dellinger, Feb 14 '19 at 20:30
I also want to know that how can I apply this to live input if I deploy a model using this dataset? @JamesDellinger — Tarun, Feb 15 '19 at 10:45
I assume you would include the above steps in your pre-processing pipeline, and any new input would be preprocessed accordingly before predictions are made on it. — James Dellinger, Feb 16 '19 at 00:09

Convert categorical data into dummy set

2 Answers2