1

I'm new to Pandas but thanks to Add column with constant value to pandas dataframe I was able to add different columns at once with

c = {'new1': 'w', 'new2': 'y', 'new3': 'z'}
df.assign(**c)

However I'm trying to figure out what's the path to take when I want to add a new column to a dataframe (currently 1.2 million rows * 23 columns).

Let's simplify the df a bit and try to make it more clear:

Order   Orderline   Product  
1       0           Laptop  
1       1           Bag  
1       2           Mouse  
2       0           Keyboard  
3       0           Laptop  
3       1           Mouse  

I would like to add a new column where depending if the Order has at least 1 product == Bag then it should be 1 (for all rows for that specific order), otherwise 0.

Result would become:

Order   Orderline   Product   HasBag  
1       0           Laptop    1  
1       1           Bag       1  
1       2           Mouse     1  
2       0           Keyboard  0  
3       0           Laptop    0  
3       1           Mouse     0  

What I could do is find all the unique order numbers, then filter out the subframe, check the Product column for Bag, if found then add 1 to a new column, otherwise 0, and then replace the original subframe with the result.

Likely there's a way better manner to accomplish this, and also way more performant.

The main reason I'm trying to do this, is to flatten things down later on. Every order should become 1 line with some values of product. I don't need the information for Bag anymore but I want to keep in my dataframe if the original order used to have a Bag (1) or no Bag (0).

Ultimately when the data is cleaned out it can be used as a base for scikit-learn (or that's what I hope).

Kris van der Mast
  • 16,343
  • 8
  • 39
  • 61

1 Answers1

2

If I understand you correctly, you want GroupBy.transform.any

First we create a boolean array by checking which rows in Product are Bag with Series.eq. Then we GroupBy on this boolean array and check if any of the values are True. We use transform to keep the shape of our initial array so we can assign the values back.

df['ind'] = df['Product'].eq('Bag').groupby(df['Order']).transform('any').astype(int)

   Order  Orderline   Product  ind
0      1          0    Laptop    1
1      1          1       Bag    1
2      1          2     Mouse    1
3      2          0  Keyboard    0
4      3          0    Laptop    0
5      3          1     Mouse    0
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • Thanks! I needed to make the `astype(float)` as some values are NaN and it didn't take int as type. I still need to clean a lot in the data (and will add quite a bunch of other columns in the same way as you described). – Kris van der Mast Apr 21 '20 at 20:42
  • Hi @Erfan, could you please elaborate a bit more? What's the 'any' for in transform? Or is it the function any on the grouped by dataframe: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.any.html? – Kris van der Mast Apr 22 '20 at 06:42
  • Found an interesting article about this transform functionality: https://pbpython.com/pandas_transform.html. – Kris van der Mast Apr 22 '20 at 07:58