0

I'm trying to give numerical representations of strings, so I'm using Pandas' factorize

For example Toyota = 1, Safeway = 2 , Starbucks =3

Currently it looks like (and this works):

#Create easy unique IDs for subscription names i.e. 1,2,3,4,5...etc..
df['SUBS_GROUP_ID'] = pd.factorize(df['SUBSCRIPTION_NAME'])[0] + 1

However, I only want to factorize subscription names where the SUB_GROUP_ID is null. So my thought was, grab all null rows, then run factorize function.

mask_to_grab_nulls = df['SUBS_GROUP_ID'].isnull()

df[mask_to_grab_nulls]['SUBS_GROUP_ID'] =  pd.factorize(df[mask_to_grab_nulls]['SUBSCRIPTION_NAME'])[0] + 1

This runs, but does not change any values... any ideas on how to solve this?

mikelowry
  • 1,307
  • 4
  • 21
  • 43

4 Answers4

2

This is likely related to chained assignments (see more here). Try the solution below, which isn't optimal but should work fine in your case:

df2 = df[df['SUBS_GROUP_ID'].isnull()] # isolate the Null IDs
df2['SUBS_GROUP_ID'] = pd.factorize(df2['SUBSCRIPTION_NAME'])[0] + 1 # factorize
df = df.dropna() # drop Null rows from the original table
df_fin = pd.concat([df,df2]) # concat df and df2
1

You can use labelencoder.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df=df.dropna(subset=['SUBS_GROUP_ID'])#drop null values
df_results =le.fit_transform(df.SUBS_GROUP_ID.values) #encode string to classes
df_results
hilo
  • 116
  • 11
1

I would use numpy.where to factorize only the non nan values.

import pandas as pd
import numpy as np

df = pd.DataFrame({'SUBS_GROUP_ID': ['ID-001', 'ID-002', np.nan, 'ID-004', 'ID-005'],
                   'SUBSCRIPTION_NAME': ['Toyota', 'Safeway', 'Starbucks', 'Safeway', 'Toyota']})
                   
df['SUBS_GROUP_ID'] = np.where(~df['SUBS_GROUP_ID'].isnull(), pd.factorize(df['SUBSCRIPTION_NAME'])[0] + 1, np.nan)

>>> print(df)

enter image description here

Timeless
  • 22,580
  • 4
  • 12
  • 30
1

What you are doing is called chained indexing, which has two major downsides and should be avoided:

  1. It can be slower than the alternative, because it involves more function calls.
  2. The result is unpredictable: Why does assignment fail when using chained indexing?

I'm a bit surprised you haven't seen a SettingWithCopy warning. The warning points you in the right direction:

... Try using .loc[row_indexer,col_indexer] = value instead

So this should work:

mask_to_grab_nulls = df['SUBS_GROUP_ID'].isnull()
df.loc[mask_to_grab_nulls, 'SUBS_GROUP_ID'] = pd.factorize(
    df.loc[mask_to_grab_nulls, 'SUBSCRIPTION_NAME']
)[0] + 1
Timus
  • 10,974
  • 5
  • 14
  • 28