pd.get_dummies() slow on large levels

Question

I'm unsure if this is already the fastest possible method, or if I'm doing this inefficiently.

I want to hot encode a particular categorical column which has 27k+ possible levels. The column has different values in 2 different datasets, so I combined the levels first before using get_dummies()

def hot_encode_column_in_both_datasets(column_name,df,df2,sparse=True):
    col1b = set(df2[column_name].unique())
    col1a = set(df[column_name].unique())
    combined_cats = list(col1a.union(col1b))
    df[column_name] = df[column_name].astype('category', categories=combined_cats)
    df2[column_name] = df2[column_name].astype('category', categories=combined_cats)

    df = pd.get_dummies(df, columns=[column_name],sparse=sparse)
    df2 = pd.get_dummies(df2, columns=[column_name],sparse=sparse)
    try:
        del df[column_name]
        del df2[column_name]
    except:
        pass
    return df,df2

However, Its been running for more than 2 hours and it's still stuck hot encoding.

Could I be doing something wrongly here? Or is it just the nature of running it on large datasets?

Df has 6.8m rows and 27 columns, Df2 has 19990 rows and 27 columns before hot encoding the column that I wanted to.

Advice appreciated, thank you! :)

`except: pass` is always wrong. I think you want `if column_name in df:` instead. As for the rest of your question, why don't you tell us which particular line is taking a long time? — John Zwinck, May 28 '17 at 14:59
@JohnZwinck Thank you for your input :) In this case I dont think it really matters though, please correct me if im wrong. — Wboy, May 28 '17 at 15:00
@JohnZwinck As I mentioned, the get_dummies() is taking a long time — Wboy, May 28 '17 at 15:03
IMO `CountVectorizer` is an optimal choice for this task. I could write a small demo if you would provide small reproducible data sets and desired data set... — MaxU - stand with Ukraine, May 29 '17 at 16:08
@MaxU I would be curious to see how to use `CountVectorizer` on numeric data — user6903745, Apr 09 '18 at 10:07
Did you try OneHotEncoder? There's an argument `handle_unknown` that you can set to `'ignore'` in case values are present in `df` and not in `df2` — user6903745, Apr 09 '18 at 10:09
@user6903745, IMO it doesn't make much sense to use `CountVectorizer` for numeric data, `OneHotEncoder` would be better in this case... — MaxU - stand with Ukraine, Apr 09 '18 at 11:03

blueogive · Answer 1 · 2017-05-29T13:12:52.860

I reviewed the get_dummies source code briefly, and I think it may not be taking full advantage of the sparsity for your use case. The following approach may be faster, but I did not attempt to scale it all the way up to the 19M records you have:

import numpy as np
import pandas as pd
import scipy.sparse as ssp

np.random.seed(1)
N = 10000

dfa = pd.DataFrame.from_dict({
    'col1': np.random.randint(0, 27000, N)
    , 'col2b': np.random.choice([1, 2, 3], N)
    , 'target': np.random.choice([1, 2, 3], N)
    })

# construct an array of the unique values of the column to be encoded
vals = np.array(dfa.col1.unique())
# extract an array of values to be encoded from the dataframe
col1 = dfa.col1.values
# construct a sparse matrix of the appropriate size and an appropriate,
# memory-efficient dtype
spmtx = ssp.dok_matrix((N, len(vals)), dtype=np.uint8)
# do the encoding. NB: This is only vectorized in one of the two dimensions.
# Finding a way to vectorize the second dimension may yield a large speed up
for idx, val in enumerate(vals):
    spmtx[np.argwhere(col1 == val), idx] = 1

# Construct a SparseDataFrame from the sparse matrix and apply the index
# from the original dataframe and column names.
dfnew = pd.SparseDataFrame(spmtx, index=dfa.index,
                           columns=['col1_' + str(el) for el in vals])
dfnew.fillna(0, inplace=True)

UPDATE

Borrowing insights from other answers here and here, I was able to vectorize the solution in both dimensions. In my limited testing, I noted that constructing the SparseDataFrame seems to increase the execution time several fold. So, if you don't need to return a DataFrame-like object, you can save a lot of time. This solution also handles the case where you need to encode 2+ DataFrames into 2-d arrays with equal numbers of columns.

import numpy as np
import pandas as pd
import scipy.sparse as ssp

np.random.seed(1)
N1 = 10000
N2 = 100000

dfa = pd.DataFrame.from_dict({
    'col1': np.random.randint(0, 27000, N1)
    , 'col2a': np.random.choice([1, 2, 3], N1)
    , 'target': np.random.choice([1, 2, 3], N1)
    })

dfb = pd.DataFrame.from_dict({
    'col1': np.random.randint(0, 27000, N2)
    , 'col2b': np.random.choice(['foo', 'bar', 'baz'], N2)
    , 'target': np.random.choice([1, 2, 3], N2)
    })

# construct an array of the unique values of the column to be encoded
# taking the union of the values from both dataframes.
valsa = set(dfa.col1.unique())
valsb = set(dfb.col1.unique())
vals = np.array(list(valsa.union(valsb)), dtype=np.uint16)


def sparse_ohe(df, col, vals):
    """One-hot encoder using a sparse ndarray."""
    colaray = df[col].values
    # construct a sparse matrix of the appropriate size and an appropriate,
    # memory-efficient dtype
    spmtx = ssp.dok_matrix((df.shape[0], vals.shape[0]), dtype=np.uint8)
    # do the encoding
    spmtx[np.where(colaray.reshape(-1, 1) == vals.reshape(1, -1))] = 1

    # Construct a SparseDataFrame from the sparse matrix
    dfnew = pd.SparseDataFrame(spmtx, dtype=np.uint8, index=df.index,
                               columns=[col + '_' + str(el) for el in vals])
    dfnew.fillna(0, inplace=True)
    return dfnew

dfanew = sparse_ohe(dfa, 'col1', vals)
dfbnew = sparse_ohe(dfb, 'col1', vals)

Hey, thanks for your answer! :) How would this deal with accounting for the categories in the second dataframe as well? — Wboy, May 29 '17 at 04:02
Hello again! :) If i understand this correctly, This only returns the current column as a sparse dataframe instead of combining it into the original dataframe right? Also, I get a ValueError: unable to coerce current fill_value nan to uint8 dtype when trying it out — Wboy, May 29 '17 at 16:49
Hmm, I can't reproduce the ValueError. I'm using pandas 0.20.1, which was only recently released. If you need to reassemble a complete dataframe that includes all the columns of the original, less the one-hot-encoded column, then you can add this statement at the end: `dfa = pd.concat([dfanew, dfa.drop('col1', axis=1)], axis=1)`. — blueogive, May 29 '17 at 17:09
If the ValueError persists, I suspect that removing the `dtype=np.uint8` argument from the `SparseDataFrame` call will resolve it. That argument is not strictly necessary. — blueogive, May 29 '17 at 17:12

pd.get_dummies() slow on large levels

1 Answers1

Linked