I want to draw more attention to a portion of @michele-piccolini's answer.
I strongly believe that .assign
is the best solution here. In the real world, these operations are not in isolation, but in a chain of operations. And if you want to support a chain of operations, you should probably use the .assign
method.
Here is an example using snowfall data at a ski resort (but the same principles would apply to say ... financial data).
This code reads like a recipe of steps. Both assignment (with =
) and .insert
make this much harder:
raw = pd.read_csv('https://github.com/mattharrison/datasets/raw/master/data/alta-noaa-1980-2019.csv',
parse_dates=['DATE'])
def clean_alta(df):
return (df
.loc[:, ['STATION', 'NAME', 'LATITUDE', 'LONGITUDE', 'ELEVATION', 'DATE',
'PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN', 'TOBS']]
.groupby(pd.Grouper(key='DATE', freq='W'))
.agg({'PRCP': 'sum', 'TMAX': 'max', 'TMIN': 'min', 'SNOW': 'sum', 'SNWD': 'mean'})
.assign(LOCATION='Alta',
T_RANGE=lambda w_df: w_df.TMAX-w_df.TMIN)
)
clean_alta(raw)
Notice the line .assign(LOCATION='Alta',
that creates a column with a single value in the middle of the rest of the operations.