Pandas - Apply a function on a comma-separated column of feature names and store the weights in separate corresponding columns

Question

Consider the following dataframe df in which the feature column is string of comma separated feature names in a dataset (df can be potentially large).

index    features
1        'f1'  
2        'f1, f2'
3        'f1, f2, f3'

I also have a function get_weights that accepts a comma-separated string of feature names and calculates and returns a list that contains a weight for each given weight. The implementation details are not important and for the sake of simplicity, let's consider that the function returns equal weights for each feature:

import numpy as np
def get_weights(features):
   features = features.split(', ')
   return np.ones(len(features)) / len(features)

Using pandas, how can I apply the get_weights on df and have the results in a new dataframe as below:

index   f1     f2    f3 
1       1      0      0
2       0.5    0.5    0
3       0.33   0.33   0.33

That is, in the resulting dataframe, the features in df.features are turned into columns that contain the weight for that feature per row.

score 1 · Answer 1 · answered Oct 18 '22 at 09:13

You can use:

df2 = (pd.DataFrame([get_weights(s) for s in df['features']], index=df.index)
         .fillna(0).rename(columns=lambda x: f'f{x+1}')
       )
out = df.drop(columns='features').join(df2)

output:

   index        f1        f2        f3
0      1  1.000000  0.000000  0.000000
1      2  0.500000  0.500000  0.000000
2      3  0.333333  0.333333  0.333333

Gonçalo Peres · Accepted Answer · 2022-10-18T09:44:57.603

Option 1

Consindering that the goal is to apply the function to the dataframe features, one can use pandas.Series.apply as follows

df = df['features'].apply(lambda x: pd.Series(get_weights(x)))

[Out]:

          0         1         2
0  1.000000       NaN       NaN
1  0.500000  0.500000       NaN
2  0.333333  0.333333  0.333333

However, in order to obtain the desired output, there are still a few things one has to do.

First, adjust the previous operation to fill the NaN with 0

df = df['features'].apply(lambda x: pd.Series(get_weights(x))).fillna(0)

[Out]:

          0         1         2
0  1.000000  0.000000  0.000000
1  0.500000  0.500000  0.000000
2  0.333333  0.333333  0.333333

Second, one wants the name of the columns to be, respectively, f1, f2, and f3. For that, one can do the following

df = df['features'].apply(lambda x: pd.Series(get_weights(x))).fillna(0).rename(columns={0: 'f1', 1: 'f2', 2: 'f3'})

[Out]:

         f1        f2        f3
0  1.000000  0.000000  0.000000
1  0.500000  0.500000  0.000000
2  0.333333  0.333333  0.333333

Now, starting from this previous operation, as it is missing the column index starting at 1, one can simply do the following

df['index'] = df.index + 1

[Out]:

   index        f1        f2        f3
0      1  1.000000  0.000000  0.000000
1      2  0.500000  0.500000  0.000000
2      3  0.333333  0.333333  0.333333

Finally, if the goal is to make the index column the index of the dataframe, one can use pandas.DataFrame.set_index as follows

df = df.set_index('index')

[Out]:

             f1        f2        f3
index                              
1      1.000000  0.000000  0.000000
2      0.500000  0.500000  0.000000
3      0.333333  0.333333  0.333333

Option 2

If one doesn't want to use .apply() (as per the first Note below), another option, and a one-liner that satisfies all the requirements, would be to create a new dataframe as follows

df_new = pd.DataFrame([get_weights(x) for x in df['features']]).fillna(0).rename(columns={0: 'f1', 1: 'f2', 2: 'f3'}).set_index(pd.Series(range(1, len(df)+1), name='index'))

[Out]:

             f1        f2        f3
index                              
1      1.000000  0.000000  0.000000
2      0.500000  0.500000  0.000000
3      0.333333  0.333333  0.333333

Notes:

There are strong opinions on using .apply(). Would recommend reading this: When should I (not) want to use pandas apply() in my code?

@GonçaloPeres Thanks for the help. Considering that `apply` can be slow, is there any alternative? — MxNx, Oct 18 '22 at 09:35

Ángel De Jaén Gotarredona · Answer 3 · 2022-10-18T10:10:29.757

Using the function get_dummies from pandas you can do:

# 0- Let's define an example pandas DataFrame:

df = pd.DataFrame(
    {
        "features": ["f1", "f1, f2", "f1, f2, f3", "f1, f4"]
    }
)

# 1- Convert column of strings into Series of lists:

aux_series = df["features"].str.split(", ")

# 2- Use get_dummies function, transpose the result and fill NaN's

aux_df = pd.concat([pd.get_dummies(aux_series[i]).sum() for i in df.index], axis=1).T.fillna(0)

# 3- Get the 'weight' of each value diving by its row summatory

output_df = aux_df.div(aux_df.sum(axis=1), axis=0)

# 4- Print the result:

print(output_df)

[Out]:

         f1        f2        f3   f4
0  1.000000  0.000000  0.000000  0.0
1  0.500000  0.500000  0.000000  0.0
2  0.333333  0.333333  0.333333  0.0
3  0.500000  0.000000  0.000000  0.5

G.G · Answer 4 · 2023-02-18T03:32:23.340

0

df1 = pd.DataFrame(
    {
        "features": ["f1", "f1, f2", "f1, f2, f3"]
    }
)
df2=df1.features.str.get_dummies(sep=',')
df2.mul(df2.sum(1).rdiv(1).round(2),axis=0)

output:

    f2    f3    f1
0  0.00  0.00  1.00
1  0.50  0.00  0.50
2  0.33  0.33  0.33

edited Feb 18 '23 at 03:32

answered Feb 17 '23 at 07:17

G.G

639
1
5

Please double check your output. It does not match the desired output of the question. – General Grievance Feb 17 '23 at 13:06

Pandas - Apply a function on a comma-separated column of feature names and store the weights in separate corresponding columns

4 Answers4