0

I want to import a file and create two additional columns right after I import the file:

The file I am importing has the following structure:

index probability_model
1 0.34
2 0.03
3 0.14
4 0.23

The following code works, but I'm trying to avoid it:

df = pd.read_csv(filename)
df['subgroups'] = df['probability_model'].transform(lambda x: pd.qcut(x, 100, duplicates='drop',labels=range(1,101)))
df['groups'] = df['subgroups'].apply(lambda x: 'high' if x>100 else 'medium' if 100>=x>50 else 'low' )

What I would like to do is something like the following. The first assign works well but the second throws an error.

df = pd.read_csv(filename)\
.assign(subgroups = lambda x: pd.qcut(x.probability_model, 100, duplicates='drop',labels=range(1,101)))\
.assign(groups = subgroups.apply(lambda x: 'high' if x>100 else 'medium' if 100>=x>50 else 'low')
Javier Monsalve
  • 326
  • 3
  • 14
  • 1
    Please [stop using `apply`](https://stackoverflow.com/questions/54432583/when-should-i-not-want-to-use-pandas-apply-in-my-code)... – Quang Hoang Apr 07 '21 at 16:50
  • 2
    Its getting more complicated when you deny using this : https://stackoverflow.com/questions/66967545/assign-a-new-column-in-pandas-in-a-similar-way-as-in-pyspark please try to make your code simple., else you can end up using a lot of lambdas which will make your code *not so readable* , different languages has different ways of writing efficient code, `assign` doesnot carry over the existing columns you create on the copy of the dataframe as spark does for example with `withColumn` – anky Apr 07 '21 at 16:58

1 Answers1

2

The problem here is that, the second assign method is using subgroups column which is still not present in df.

You first need to assign subgroups column to df:

df = pd.read_csv(filename)\
.assign(subgroups = lambda x: pd.qcut(x.probability_model, 100, duplicates='drop',labels=range(1,201)))

Now, you can use assign again for the groups column.

Take below MRE for example:

In [1648]: df
Out[1648]: 
   Balances  Weight
0        10       7
1        11      15
2        12      30
3        13      20
4        10      15
5        13      20

In [1646]: df.assign(a=df.Balances + df.Weight).assign(b=df.a+df.Weight)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-1646-86bddf31de6d> in <module>
----> 1 df.assign(a=df.Balances + df.Weight).assign(b=df.a+df.Weight)

~/Library/Python/3.8/lib/python/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5463             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5464                 return self[name]
-> 5465             return object.__getattribute__(self, name)
   5466 
   5467     def __setattr__(self, name: str, value) -> None:

AttributeError: 'DataFrame' object has no attribute 'a'
Mayank Porwal
  • 33,470
  • 8
  • 37
  • 58