time taken by code is an issue for python dataframe

Question

I need help related to dataframe related time taken by following code. it takes around 20 sec to complete dataset around 2000 records.

def findRe(leaddatadf, keyAttributes, datadf):
    for combs in itertools.combinations(atrList,
        len(atrList)-1):

        v_by =(set(atrList) - set(combs)) # varrying


    grpdatapf=datadf.groupby(combs)
    for name, group in grpdatapf:

        if(group.shape[0]>1):

            tmpgdf = leaddatadf[leaddatadf['unique_id'].astype(float).\
                isin(group['unique_id'].astype(float))]
            if(tmpgdf.shape[0]>1):

                tmpgdf['mprice']=tmpgdf['mprice'].astype(float)
                tmpgdf=tmpgdf.sort('mprice')

                tmpgdf['id'] = tmpgdf['id']
                tmpgdf['desc'] = tmpgdf['description']
                tmpgdf['related_id'] = tmpgdf['id'].shift(-1)
                tmpgdf['related_desc'] = tmpgdf['description'].shift(-1)
                tmpgdf['related_mprice'] = tmpgdf['mprice'].shift(-1)

                tmpgdf['pld'] = np.where(
                    (tmpgdf['related_price'].astype(float) > \
                        tmpgdf['mprice'].astype(float)),
                    (tmpgdf['related_price'].astype(float) - \
                        tmpgdf['mprice'].astype(float)) ,
                    (tmpgdf['mprice'].astype(float) - \
                        tmpgdf['related_mprice'].astype(float)))
                tmpgdf['pltxt'] = np.where(
                    tmpgdf['related_mprice'].astype(float) - \
                        tmpgdf['mprice'].astype(float)>0.0,'<',
                    np.where(tmpgdf['related_mprice'].astype(float)\
                        - tmpgdf['mprice'].astype(float)<0,'>','='))
                tmpgdf['prc_rlt_dif_nbr_p'] = abs(
                    (tmpgdf['pld'].astype(float) / \
                        ((tmpgdf['mprice'].astype(float)))) )
                tmpgdf['keyatr'] = str(atrList)
                tmpgdf['varying'] = np.where(1==1,
                    "".join(v_by ),'')# varrying

                temp = tmpgdf[['id',
            'desc', 'related_id',
            'related_desc', 'pltxt', 'pld',
            'prc_rlt_dif_nbr_p', 'mprice', 'related_mprice',
            'keyatr', 'varying']]

                temp = temp[temp['related_mprice'].astype(float)>=0.0]
                reldf.extend(list(temp.T.to_dict().values()))
    return pd.DataFrame(
                reldf, columns = ['id',
                    'desc', 'related_id',
                    'related_desc', 'pltxt', 'pld',
                    'prc_rlt_dif_nbr_p', 'mprice', 'related_mprice',
                    'keyatr', 'varying'])

I think Stackoverflow is not meant to be a code review site. This question isn't asking anything in particular, so I think it should be moved somewhere else. — cd98, Oct 06 '16 at 21:32

score 0 · Answer 1 · edited May 23 '17 at 12:16

0

please print after every line how many ms it takes

use this https://stackoverflow.com/a/1557584/2655092

and return with the lines that takes the most time

edited May 23 '17 at 12:16

Community

1
1

answered Oct 06 '16 at 10:12

whoopdedoo

2,815
23
46

Time taken by getting tmpdf - 0.00083160400390625 - Time taken by reset_index - 0.0006613731384277344 - Time taken by mprice to float - 0.0002810955047607422 - Time taken by time taken by sort by mprice - 0.0007467269897460938 - Time taken by id - 0.0015559196472167969 - Time taken by desc - 0.0017049312591552734 - – Narendra Mohan Prasad Oct 06 '16 at 11:23
Time taken by related_id - 0.0018208026885986328 - Time taken by related_desc - 0.0018434524536132812 - Time taken by related_mprice - 0.0015764236450195312 - Time taken by pld - 0.0020411014556884766 - Time taken by pltxt - 0.0022830963134765625 - Time taken by prc_rlt_dif_nbr_p - 0.001756429672241211 - Time taken by keyatr - 0.0015103816986083984 - Time taken by varying - 0.00200653076171875 - Time taken by allcomb - 0.0007736682891845703 - Time taken to convert df to list of dict - 0.00047779083251953125 - – Narendra Mohan Prasad Oct 06 '16 at 11:24
hi Jm I have added time taken by each line. total 14 key atr list size is there – Narendra Mohan Prasad Oct 06 '16 at 11:24

score 0 · Answer 2 · answered Oct 06 '16 at 12:42

0

You're using astype(float) very often. Every time you use it - a copy of the series is created. You could try to set dtype=float at the very beginning when you're trying to load the dataframe - that way you're only converting the series to float once - and not on every iteration :)

Let me know if this helps

answered Oct 06 '16 at 12:42

Shivam Gaur

1,032
10
17

thanks sir I will do that, other thing I wanted to improve , i have 14-1 unique combinations and for each I am grouping and finding the relations between current and next row. – Narendra Mohan Prasad Oct 06 '16 at 14:08

time taken by code is an issue for python dataframe

2 Answers2