0

I want to make a For Loop given below, faster in python.

import pandas as pd
import numpy as np
import scipy

np.random.seed(1)
xl = pd.DataFrame({'Concat' : np.arange(101,999), 'ships_x' : np.random.randint(1001,3000,size=898)})
yl = pd.DataFrame({'PickDate' : np.random.randint(1,8,size=10000),'Concat' : np.random.randint(101,999,size=10000), 'ships_x' : np.random.randint(101,300,size=10000), 'ships_y' : np.random.randint(1001,3000,size=10000)})
tempno = [np.random.randint(1,100,size=5)]
k=1
p = pd.DataFrame(0,index=np.arange(len(xl)),columns=['temp','cv']).astype(object)

for ib in [xb for xb in range(0,len(xl))]:
                tempno1 = np.append(tempno,ib)
                temp = list(set(tempno1))
                temptab = yl[yl['Concat'].isin(np.array(xl['Concat'][tempno1]))].groupby('PickDate')['ships_x','ships_y'].sum().reset_index()
                temptab['contri'] = temptab['ships_x']/temptab['ships_y']
                p.ix[k-1,'cv'] = 1 if math.isnan(scipy.stats.variation(temptab['contri'])) else scipy.stats.variation(temptab['contri'])
                p.ix[k-1,'temp'] = temp
                k = k+1

where,

xl, yl - two data frames I am working on with columns like Concat, x_ships and y_ships.

tempno - a initial list of indices of xl dataframe, referring to a list of 'Concat' values.

So, in for loop we add one extra index to tempno in each iteration and then subset 'yl' dataframe based on 'Concat' values matching with those of 'xl' dataframe. Then, we find "coefficient of variation"(taken from scipy lib) and make note in new dataframe 'p'.

The problem is it is taking too much time as number of iterations of for loop varies in thousands. The 'group_by' line is taking maximum time. I have tried and made a few changes, now the code look likes below, changes made mentioned in comments. There is a slight improvement but this doesn't solve my purpose. Please suggest the fastest way possible to implement this. Many thanks.

# Getting all tempno1 into a list with one step
tempno1 = [np.append(tempno,ib) for ib in [xb for xb in range(0,len(xl))]]
temp = [list(set(tempk)) for tempk in tempno1]
# Taking only needed columns from x and y dfs
xtemp = xl[['Concat']]
ytemp = yl[['Concat','ships_x','ships_y','PickDate']]
#Shortlisting y df and groupby in two diff steps
ytemp = [ytemp[ytemp['Concat'].isin(np.array(xtemp['Concat'][tempnokk]))] for tempnokk in tempno1]
temptab = [ytempk.groupby('PickDate')['ships_x','ships_y'].sum().reset_index() for ytempk in ytemp]
tempkcontri = [tempk['ships_x']/tempk['ships_y'] for tempk in temptab]
tempkcontri = [pd.DataFrame(tempkcontri[i],columns=['contri']) for i in range(0,len(tempkcontri))]
temptab = [temptab[i].join(tempkcontri[i]) for i in range(0,len(temptab))]
pcv = [1 if math.isnan(scipy.stats.variation(temptabkk['contri'])) else scipy.stats.variation(temptabkk['contri']) for temptabkk in temptab]
p = pd.DataFrame({'temp' : temp,'cv': pcv})
  • 1
    In your for loop, why do you have `for ib in [xb for xb in range(0,len(x))]` rather than `for ib in range(0,len(x))`? – jambrothers Mar 06 '17 at 17:17
  • 1
    1) you almost never need loops with pandas; 2) this question is impossible to troubleshoot without and data. So read this and edit your questions to be reproducible: http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – Paul H Mar 06 '17 at 17:30
  • @jambrothers: The actual query is `for ib in [xb for xb in range(0,len(x)) if xb not in myconcat(groups[rs]['temp']) and xb not in tempno]]` , hence couldn't use `for ib in range(0,len(x))` Can I make it better ? – Akhilesh chander Mar 07 '17 at 05:46
  • @Akhileshchander it's a bit hard to read, and like Paul said not really possible to reproduce your experience, I can't really comment beyond that I'm afraid. – jambrothers Mar 07 '17 at 10:43
  • @PaulH: I have edited as recommended. Please help. – Akhilesh chander Mar 07 '17 at 14:10
  • so what do you want the output to look like, given those input dataframes? – Paul H Mar 07 '17 at 16:50
  • The output is a dataframe p, which has columns of temp and cv. I need alternate suggestions for executing the code, much faster. – Akhilesh chander Mar 08 '17 at 11:49

0 Answers0