0

I'm trying to write a Python function that does One-Hot encoding in-place but I'm having trouble finding a way to do a concat operation in-place at the end. It appears to make a copy of my DataFrame for the concat output and I am unable to assign this to my DataFrame that I passed by reference.

How can this be done?

def one_hot_encode(df, col: str):
     """One-Hot encode inplace. Includes NAN.

     Keyword arguments:
     df (DataFrame) -- the DataFrame object to modify
     col (str) -- the column name to encode
     """

     insert_loc = df.columns.get_loc(col)
     insert_data = pd.get_dummies(df[col], prefix=col + '_', dummy_na=True)

     df.drop(col, axis=1, inplace=True)
     df[:] = pd.concat([df.iloc[:, :insert_loc], insert_data, df.iloc[:, insert_loc:]], axis=1) # Doesn't take effect outside function
Josh K
  • 1
  • 1

4 Answers4

0

To make the change take affect outside the function, we have to change the object that was passed in rather than replace its name (inside the function) with a new object.

To assign the new columns, you can use

df[insert_data.columns] = insert_data

instead of the concat.

That doesn't take advantage of your careful insert order though. To retain your order, we can redindex the data frame.

df.reindex(columns=cols)

where cols is the combined list of columns in order:

cols = [cols[:insert_loc] + list(insert_data.columns) + cols[insert_loc:]]

Putting it all together,

import pandas as pd

def one_hot_encode(df, col: str):
    """One-Hot encode inplace. Includes NAN.

    Keyword arguments:
    df (DataFrame) -- the DataFrame object to modify
    col (str) -- the column name to encode
    """

    cols = list(df.columns)
    insert_loc = df.columns.get_loc(col)
    insert_data = pd.get_dummies(df[col], prefix=col + '_', dummy_na=True)

    cols = [cols[:insert_loc] + list(insert_data.columns) + cols[insert_loc:]]
    df[insert_data.columns] = insert_data
    df.reindex(columns=cols)
    df.drop(col, axis=1, inplace=True)


import seaborn

diamonds=seaborn.load_dataset("diamonds")
col="color"
one_hot_encode(diamonds, "color")

assert( "color" not in diamonds.columns ) 
assert( len([c for c in diamonds.columns if c.startswith("color")]) == 8 )

Tim
  • 1
  • 3
0

I don't think you can pass function arguments by reference in python (see: How do I pass a variable by reference? )

Instead what you can do is just return the modified df from your function, and assign result to the original df:

def one_hot_encode(df, col: str):
    ...
    return df

...
df=one_hot_encode(df, col)
Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34
0

df.insert is inplace--but can only insert one column at a time. It might not be worth the reorder.

def one_hot_encode2(df, col: str):
    """One-Hot encode inplace. Includes NAN.

    Keyword arguments:
    df (DataFrame) -- the DataFrame object to modify
    col (str) -- the column name to encode
    """

    cols = list(df.columns)
    insert_loc = df.columns.get_loc(col)
    insert_data = pd.get_dummies(df[col], prefix=col + '_', dummy_na=True)

    for offset, newcol in enumerate(insert_data.columns):
        df.insert(loc=insert_loc+offset, column=newcol, value = insert_data[[newcol]])

    df.drop(col, axis=1, inplace=True)


import seaborn

diamonds=seaborn.load_dataset("diamonds")
col="color"
one_hot_encode2(diamonds, "color")

assert( "color" not in diamonds.columns ) 
assert(len([c for c in diamonds.columns if c.startswith("color")]) == 8)

assert([(i) for i,c in enumerate(diamonds.columns) if c.startswith("color")][0] == 2)
Tim
  • 1
  • 3
0

The scope of the variables of a function are only inside that function. Simply include a return statement in the end of the function to get your modified dataframe as output. Calling this function will now return your modified dataframe. Also while assigning new (dummy) columns, instead of df[:] use df, as you are changing the dimension of original dataframe.

def one_hot_encode(df, col: str):
    insert_loc = df.columns.get_loc(col)
    insert_data = pd.get_dummies(df[col], prefix=col + '_', dummy_na=True) 
    df.drop(col, axis=1, inplace=True)
    df = pd.concat([df.iloc[:, :insert_loc], insert_data, df.iloc[:, insert_loc:]], axis=1) 
    return df

Now to see the modified dataframe, call the function and assign it to a new/existing dataframe as below

df=one_hot_encode(df,'<any column name>')