factorizing on a slice of a df

Question

I'm trying to give numerical representations of strings, so I'm using Pandas' factorize

For example Toyota = 1, Safeway = 2 , Starbucks =3

Currently it looks like (and this works):

#Create easy unique IDs for subscription names i.e. 1,2,3,4,5...etc..
df['SUBS_GROUP_ID'] = pd.factorize(df['SUBSCRIPTION_NAME'])[0] + 1

However, I only want to factorize subscription names where the SUB_GROUP_ID is null. So my thought was, grab all null rows, then run factorize function.

mask_to_grab_nulls = df['SUBS_GROUP_ID'].isnull()

df[mask_to_grab_nulls]['SUBS_GROUP_ID'] =  pd.factorize(df[mask_to_grab_nulls]['SUBSCRIPTION_NAME'])[0] + 1

This runs, but does not change any values... any ideas on how to solve this?

You should always share a subset of your data set so that others can use it to help you. — Anoushiravan R, Aug 28 '22 at 21:48
Just do `df.loc[mask_to_grab_nulls, 'SUBS_GROUP_ID'] = pd.factorize(...`. — Timus, Aug 28 '22 at 21:56
@Timus this is exactly what I was looking for, thank you! Feel free to submit it as an answer and I can accept it — mikelowry, Aug 29 '22 at 03:44

score 2 · Answer 1 · answered Aug 28 '22 at 21:34

This is likely related to chained assignments (see more here). Try the solution below, which isn't optimal but should work fine in your case:

df2 = df[df['SUBS_GROUP_ID'].isnull()] # isolate the Null IDs
df2['SUBS_GROUP_ID'] = pd.factorize(df2['SUBSCRIPTION_NAME'])[0] + 1 # factorize
df = df.dropna() # drop Null rows from the original table
df_fin = pd.concat([df,df2]) # concat df and df2

score 1 · Answer 2 · answered Aug 28 '22 at 21:33

1

You can use labelencoder.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df=df.dropna(subset=['SUBS_GROUP_ID'])#drop null values
df_results =le.fit_transform(df.SUBS_GROUP_ID.values) #encode string to classes
df_results

answered Aug 28 '22 at 21:33

hilo

116
11

score 1 · Answer 3 · answered Aug 28 '22 at 21:34

I would use numpy.where to factorize only the non nan values.

import pandas as pd
import numpy as np

df = pd.DataFrame({'SUBS_GROUP_ID': ['ID-001', 'ID-002', np.nan, 'ID-004', 'ID-005'],
                   'SUBSCRIPTION_NAME': ['Toyota', 'Safeway', 'Starbucks', 'Safeway', 'Toyota']})
                   
df['SUBS_GROUP_ID'] = np.where(~df['SUBS_GROUP_ID'].isnull(), pd.factorize(df['SUBSCRIPTION_NAME'])[0] + 1, np.nan)

`>>> print(df)`

Timus · Accepted Answer · 2022-08-29T10:47:47.230

What you are doing is called chained indexing, which has two major downsides and should be avoided:

It can be slower than the alternative, because it involves more function calls.
The result is unpredictable: Why does assignment fail when using chained indexing?

I'm a bit surprised you haven't seen a SettingWithCopy warning. The warning points you in the right direction:

... Try using .loc[row_indexer,col_indexer] = value instead

So this should work:

mask_to_grab_nulls = df['SUBS_GROUP_ID'].isnull()
df.loc[mask_to_grab_nulls, 'SUBS_GROUP_ID'] = pd.factorize(
    df.loc[mask_to_grab_nulls, 'SUBSCRIPTION_NAME']
)[0] + 1

factorizing on a slice of a df

4 Answers4

>>> print(df)

`>>> print(df)`