Pandas apply function to 100k+ row data frame

Question

I am trying to use Pandas apply row-wise on a 100k+ data frame as follows:

    def get_fruit_vege(x, all_df, f_or_v):
        y_df = all_df[all_df['key'] == x['key']]
        if f_or_v == "F":
            f_row = y_df[y_df['Group'] == int(y_df.iloc[0]['key'].split('-')[0])]
            return f_row['Number'].iloc[0]
        elif f_or_v == "V":
            v_row = y_df[y_df['Group'] == int(y_df.iloc[0]['key'].split('-')[1])]
            return v_row['Number'].iloc[0]
        else:
            return np.nan

    basket_df['Number'] = basket_df.apply(lambda x: get_fruit_vege(x, all_df, "F"), axis=1)

However the process just hangs when running in the Python console (I'm using Pycharm Community Edition). The reason I am using Pandas apply is because I need to cross-reference another data frame using a key that matches row-wise between each data frame (basket_df and all_df). Not sure what I am doing wrong here, or if I should just not be using Pandas apply. Thanks for your help!

---Update: The function does work, but it takes a good chunk of time, approximately 20 minutes or so. Is there a better way to go about this?

*cross-reference another data frame using a key that matches row-wise between each data frame* you should probably be using `merge` but without reproducible example, I can't say more. — , Mar 14 '22 at 19:31
This lookup is being done each time `y_df = all_df[all_df['key'] == x['key']]`. Does it need to be, or could it be done once? — Andrew, Mar 14 '22 at 19:42
I think it might be possible to shorten the duration of this function to a few seconds, but I need a small sample of your `basket_df` and `all_df` first. — , Mar 14 '22 at 21:38
From my experience: do not use pycharm for such large chunk of code. It is intreospection of pycharm which is very slow (and you may notice that you get errors after minutes, which usually should be given quickly). So I use PyCharm console only for test on short dataframes, but a program (rum by pycharm) or just outside pycharm — Giacomo Catenazzi, Mar 15 '22 at 13:27
Thanks for the suggestions, I actually just did a groupby on ``key`` for ``all_df`` and made lists of the column I needed based on the suggestion here: https://stackoverflow.com/questions/35024023/pandas-groupby-result-into-multiple-columns. Then I just merged the result with ``basket_df`` on ``key``. — KidSudi, Mar 15 '22 at 19:23

Pandas apply function to 100k+ row data frame

0 Answers0