I'm almost positive that you're encountering memory issues because str.get_dummies returns an array full of 1s and 0s, of datatype np.int64
. This is quite different from the behavior of pd.get_dummies, which returns an array of values of datatype uint8
.
This appears to be a known issue. However, there's been no update, nor fix, for the past year. Checking out the source code for str.get_dummies will indeed confirm that it is returning np.int64
.
An 8 bit integer will take up 1 byte of memory, while a 64 bit integer will take up 8 bytes. I'm hopeful that memory problems can be avoided by finding an alternative way to one-hot encode Col2
which ensures the output are all 8 bit integers.
Here was my approach, beginning with your example:
df = pd.DataFrame({'Col1': ['X', 'Y', 'X'],
'Col2': ['a,b,c', 'a,b', 'b,d']})
df
Col1 Col2
0 X a,b,c
1 Y a,b
2 X b,d
- Since
Col1
contains simple, non-delimited strings, we can easily one-hot encode it using pd.get_dummies:
df = pd.get_dummies(df, columns=['Col1'])
df
Col2 Col1_X Col1_Y
0 a,b,c 1 0
1 a,b 0 1
2 b,d 1 0
So far so good.
df['Col1_X'].values.dtype
dtype('uint8')
- Let's get a list of all unique substrings contained inside the comma-delimited strings in
Col2
:
vals = list(df['Col2'].str.split(',').values)
vals = [i for l in vals for i in l]
vals = list(set(vals))
vals.sort()
vals
['a', 'b', 'c', 'd']
- Now we can loop through the above list of values and use
str.contains
to create a new column for each value, such as 'a'
. Each row in a new column will contain 1 if that row actually has the new column's value, such as 'a'
, inside its string in Col2
. As we create each new column, we make sure to convert its datatype to uint8
:
col='Col2'
for v in vals:
n = col + '_' + v
df[n] = df[col].str.contains(v)
df[n] = df[n].astype('uint8')
df.drop(col, axis=1, inplace=True)
df
Col1_X Col1_Y Col2_a Col2_b Col2_c Col2_d
0 1 0 1 1 1 0
1 0 1 1 1 0 0
2 1 0 0 1 0 1
This results in a dataframe that meets your desired format. And thankfully, the integers in the four new columns that were one-hot encoded from Col2
only take up 1 byte each, as opposed to 8 bytes each.
df['Col2_a'].dtype
dtype('uint8')
If, on the outside chance, the above approach doesn't work. My advice would be to use str.get_dummies to one-hot encode Col2
in chunks of rows. Each time you do a chunk, you would convert its datatype from np.int64
to uint8
, and then transform the chunk to a sparse matrix. You could eventually concatenate all chunks back together.