1

Update I have to generate a summary statistics on a large dataframe : the data can be produced

item_rate_historical_df = pd.DataFrame({'item_id': np.random.choice(a=[0, 1, 2], size=50,  p=[0.5, 0.3, 0.2]), 
              'rating': np.random.rand(50)*5}) 

Initially I was running

start=datetime.now()
bb = item_rate_historical_df[['item_id', 'rating']].\
        groupby(['item_id']).agg(
        item_hist_rating_up=pd.NamedAgg(column='rating', aggfunc=lambda x: round(100 * sum(x>=3.75) / len(x))),
        item_hist_rating_down=pd.NamedAgg(column='rating', aggfunc=lambda x: round(100 * sum(x<3.75) / len(x))),
        item_hist_rating_q25=pd.NamedAgg(column='rating', aggfunc=lambda x: np.quantile(x, q = 0.25 )),
        item_hist_rating_q75=pd.NamedAgg(column='rating', aggfunc=lambda x: np.quantile(x, q = 0.75 )),
        item_hist_rating_min=pd.NamedAgg(column='rating', aggfunc='min'),
        item_hist_rating_count=pd.NamedAgg(column='rating', aggfunc='count'),
        item_hist_rating_max=pd.NamedAgg(column='rating', aggfunc='max'),
        item_hist_rating_avg=pd.NamedAgg(column='rating', aggfunc=np.mean),
    ).reset_index().round(2)
datetime.now()-start

which is nicely readable but seems not efficient when the data is large (50gig). That is why I was seeking an approach to leverage TensorFlow and GPU

@ALollz kindly recommended to use vectorized function from pandas, perhaps as following -

start=datetime.now()
aa=pd.concat([item_rate_historical_df['rating'].ge(3.75).groupby(item_rate_historical_df['item_id']).mean(),
                item_rate_historical_df['rating'].lt(3.75).groupby(item_rate_historical_df['item_id']).mean(),
                item_rate_historical_df.\
                    groupby(['item_id'])['rating'].quantile(0.25), 
                item_rate_historical_df.\
                    groupby(['item_id'])['rating'].quantile(0.75), 
                item_rate_historical_df.groupby('item_id')[['rating']].agg({"rating": ['count', 'min', 'max', 'mean']})], 
          axis = 1).round(2)

datetime.now()-start

I can see on each line runs faster, however the pd.concat is too slow. For example

item_rate_historical_df['rating'].lt(3.75).groupby(item_rate_historical_df['item_id']).mean() is faster that

.agg(
            item_hist_rating_up=pd.NamedAgg(column='rating', aggfunc=lambda x: round(100 * sum(x>=3.75) / len(x)))

but when I concat them together using pd.concat it ends up to be even slower than the original approach

OLD I do have a groupby command like

df_historical_avg_rate = user_historical_df[['item', 'rating']].\
        groupby(['item_id']).agg(
        item_hist_rating_up=pd.NamedAgg(column='rating', aggfunc=lambda x: round(100 * sum(x>=3.75) / len(x))),
        item_hist_rating_down=pd.NamedAgg(column='rating', aggfunc=lambda x: round(100 * sum(x<3.75) / len(x))),
        item_hist_rating_q25=pd.NamedAgg(column='rating', aggfunc=lambda x: np.quantile(x, q = 0.25 )),
        item_hist_rating_q75=pd.NamedAgg(column='rating', aggfunc=lambda x: np.quantile(x, q = 0.75 )),
        item_hist_rating_min=pd.NamedAgg(column='rating', aggfunc='min'),
        item_hist_rating_count=pd.NamedAgg(column='rating', aggfunc='count'),
        item_hist_rating_max=pd.NamedAgg(column='rating', aggfunc='max'),
        item_hist_rating_avg=pd.NamedAgg(column='rating', aggfunc=np.mean),
    ).reset_index().round(2)

this is too slow for large data (50gig of data). One has to split and leverage multi-processor, which still ends up to stay slow (a few days). I know Rust can help here, but I am keen to have a python solution. I wonder if I can leverage TensorFlow ? I have seen tf.experimental.group_by_reducer from here but couldn't find similar examples

Areza
  • 5,623
  • 7
  • 48
  • 79
  • 2
    Remove those first 4 aggs with a lambda and see how long it takes. I'd be surprised if those took more than tens of minutes (even with 50gig of data). Your lambdas are slow loops over the groups and with 50gig of data you could be doing billions of loops. Quantile has a DataFrameGroupBy implementation and your first operation can be re-written in terms of vectorized operations. You don't need tensorflow, you just need to do minimal edits to your code to leverage vectorization and the cython implementations of your operations – ALollz Mar 29 '21 at 21:18
  • oh ! thanks - what should I do with the question and your answer ? how about if I change the title and ask how to optimize ... ? – Areza Mar 29 '21 at 21:26
  • 2
    Well this is a complicated problem, and one that's hard to fully reproduce as large data/timings can be difficult to reproduce and evaluate. I don't have the time now to fully answer (and illustrate the timings) but while `NamedAggs` are succinct they're not great because they default to slow lambdas once you need other args/parameters: https://stackoverflow.com/questions/66068515/how-do-you-change-input-parameters-of-pandas-groupby-agg-function/66068964#66068964. In this case you can use slighly less tidy syntax, but the gain in performance will be immense – ALollz Mar 29 '21 at 21:29
  • 2
    For instance, your first calculation looks like you're calcualting the mean # of ratings >= 3.75 for each wine_id. If you use something like `df['rating'].ge(3.75).groupby(df['wine_id']).mean()` it's fully vectorized/implemented in cython and will be orders of magnitude faster, taking a 2 day calculation down to several minutes (https://stackoverflow.com/questions/63306888/general-groupby-in-python-pandas-fast-way/63307380#63307380) – ALollz Mar 29 '21 at 21:30
  • I appreciate - def. give it a try – Areza Mar 29 '21 at 21:55
  • @ALollz I am doing something wrong with the concat - just updated my question – Areza Mar 30 '21 at 13:03
  • Your question has nothing to do with TensorFlow or GPU; it's simply about speeding up the performance of a pandas calculation on a huge dataframe (50Gb) when it exceeds memory size. First, use chunked reading, and make sure your chunks are < memory size. Second, fail that, try backends like dask and vaex. Third, try alternative frameworks, koalas, Spark etc. This can be done performantly by CPU. No reason to shoehorn it into TensorFlow or GPU, that's a total red herring, please edit that out and just replace with "Improve peformance". Your first task is to make it not blow out physical memory. – smci Mar 30 '21 at 22:09
  • @smci - thanks - you are right that the question has nothing to do with TF - It was originally but Alloz also mentioned what you said. Basically, I was not sure what to do with the question - I can close and delete it - regarding reading files and splitting the files. Those are all in place - the problem is even with splitting and using multiprocessor the whole process is too slow – Areza Mar 30 '21 at 22:19
  • @smci - I don't understand why do you think the question is about reading/loading the data? – Areza Mar 30 '21 at 22:27
  • *"when I glue them together using `concat()` it ends up being even slower"* => **OP is blowing out their physical memory**. Don't do that. Use `.info()` to measure memory usage, use chunking on the concat/append to avoid blowing out memory. We need to see more specifics to give more specifc solution. – smci Mar 30 '21 at 23:22
  • @user702846: whether the blowing out their physical memory happens on `pd.concat` or `append` or `read_csv`, we agree that's the issue. Chunking is almost surely the solution; not throwing out CPU-based packages. Need to see more specifics of the OP's dataframe and code to give more specific answer. – smci Mar 30 '21 at 23:24
  • my dear - I have no problem with memory - because each calculation is done in a small subset - as I have explained the in the question - each separate series are faster - but the concat is slowest part - again let's not be worry abou the memory :-) I try to clean up the question. – Areza Mar 31 '21 at 07:29

0 Answers0