Update I have to generate a summary statistics on a large dataframe : the data can be produced
item_rate_historical_df = pd.DataFrame({'item_id': np.random.choice(a=[0, 1, 2], size=50, p=[0.5, 0.3, 0.2]),
'rating': np.random.rand(50)*5})
Initially I was running
start=datetime.now()
bb = item_rate_historical_df[['item_id', 'rating']].\
groupby(['item_id']).agg(
item_hist_rating_up=pd.NamedAgg(column='rating', aggfunc=lambda x: round(100 * sum(x>=3.75) / len(x))),
item_hist_rating_down=pd.NamedAgg(column='rating', aggfunc=lambda x: round(100 * sum(x<3.75) / len(x))),
item_hist_rating_q25=pd.NamedAgg(column='rating', aggfunc=lambda x: np.quantile(x, q = 0.25 )),
item_hist_rating_q75=pd.NamedAgg(column='rating', aggfunc=lambda x: np.quantile(x, q = 0.75 )),
item_hist_rating_min=pd.NamedAgg(column='rating', aggfunc='min'),
item_hist_rating_count=pd.NamedAgg(column='rating', aggfunc='count'),
item_hist_rating_max=pd.NamedAgg(column='rating', aggfunc='max'),
item_hist_rating_avg=pd.NamedAgg(column='rating', aggfunc=np.mean),
).reset_index().round(2)
datetime.now()-start
which is nicely readable but seems not efficient when the data is large (50gig). That is why I was seeking an approach to leverage TensorFlow and GPU
@ALollz kindly recommended to use vectorized function from pandas, perhaps as following -
start=datetime.now()
aa=pd.concat([item_rate_historical_df['rating'].ge(3.75).groupby(item_rate_historical_df['item_id']).mean(),
item_rate_historical_df['rating'].lt(3.75).groupby(item_rate_historical_df['item_id']).mean(),
item_rate_historical_df.\
groupby(['item_id'])['rating'].quantile(0.25),
item_rate_historical_df.\
groupby(['item_id'])['rating'].quantile(0.75),
item_rate_historical_df.groupby('item_id')[['rating']].agg({"rating": ['count', 'min', 'max', 'mean']})],
axis = 1).round(2)
datetime.now()-start
I can see on each line runs faster, however the pd.concat
is too slow. For example
item_rate_historical_df['rating'].lt(3.75).groupby(item_rate_historical_df['item_id']).mean()
is faster that
.agg(
item_hist_rating_up=pd.NamedAgg(column='rating', aggfunc=lambda x: round(100 * sum(x>=3.75) / len(x)))
but when I concat them together using pd.concat it ends up to be even slower than the original approach
OLD I do have a groupby command like
df_historical_avg_rate = user_historical_df[['item', 'rating']].\
groupby(['item_id']).agg(
item_hist_rating_up=pd.NamedAgg(column='rating', aggfunc=lambda x: round(100 * sum(x>=3.75) / len(x))),
item_hist_rating_down=pd.NamedAgg(column='rating', aggfunc=lambda x: round(100 * sum(x<3.75) / len(x))),
item_hist_rating_q25=pd.NamedAgg(column='rating', aggfunc=lambda x: np.quantile(x, q = 0.25 )),
item_hist_rating_q75=pd.NamedAgg(column='rating', aggfunc=lambda x: np.quantile(x, q = 0.75 )),
item_hist_rating_min=pd.NamedAgg(column='rating', aggfunc='min'),
item_hist_rating_count=pd.NamedAgg(column='rating', aggfunc='count'),
item_hist_rating_max=pd.NamedAgg(column='rating', aggfunc='max'),
item_hist_rating_avg=pd.NamedAgg(column='rating', aggfunc=np.mean),
).reset_index().round(2)
this is too slow for large data (50gig of data). One has to split and leverage multi-processor, which still ends up to stay slow (a few days). I know Rust can help here, but I am keen to have a python solution. I wonder if I can leverage TensorFlow ? I have seen tf.experimental.group_by_reducer from here but couldn't find similar examples