As mentioned in the comments, you should try to avoid iterrows
and think about this in terms of a matrix problem. My first step would be to compute the sum of "col1" and "col2" for each Dataframe separately
df1["sum_col"] = df1["col1"] + df1["col2"]
df2["sum_col"] = df2["col1"] + df2["col2"]
These can then be added together with a bit of numpy
magic to get all possible sums of two numbers
all_values = (df1["sum_col"].values[np.newaxis].T +
df2["sum_col"].values[np.newaxis])
all_values
will now have shape (1000000, 1000)
which is all possible sums of the two columns.
Now, the next part is where I'm not so clear what you are trying to do... so correct me if I'm wrong. It looks to me like you are setting savVal
to the first value of each iteration of df2
(?) in this case it should have a shape of 1000000, so we can do
sav_val = all_values[:, 0]
Then we want to find the first(?) value of your inner loop that is less or equal to 1 and less than sav_val
. Let's find if these conditions are met separately
less_than_one = np.less_equal(all_values, 1)
and
less_than_sav_val = np.less(all_values.T, sav_val).T
The .T
s are transposes which help us broadcast to the right shape.
We can combine our two conditions and find the first True
value in each row using argmax
(see this question), if there is no True
value we will get the first entry in each row (index 0)
passes_condition = less_than_one & less_than_sav_val
result = df2['OLDVALUE'].values.take(passes_condition.argmax(axis=1))
Ok, almost there. result
has shape of 1000000. We can now replace those entries where we didn't have an entry with a value <= 1 and < the first iteration. We'll set them to -999
for now.
result[~passes_condition.any(axis=1)] = -999
result
has a shape of 1000000
Putting it all together
def rajat_func(df1, df2):
list_values = []
for idx, xitems in df1.iterrows():
savVal = -1
i = 99
for idy, yitems in df2.iterrows():
value = xitems['col1'] + xitems['col2'] + yitems['col1'] + yitems['col2']
#it only runs for the first time to store the value into savVal
if savVal == -1:
savVal = value
else:
if value <= 1 and value < savVal:
savVal = value
i = idy
break
if i == 99:
#df1.iat[idx , ‘NEWVALUE’] = “LESSTHAN”
#in case above code throws error then alternative is list
list_values.append(-999)
else:
#df1.iat[idx, ‘NEWVALUE’] = df2.loc[i, ‘OLDVALUE’]
list_values.append(df2.loc[i, 'OLDVALUE'])
return list_values
def new_func(df1, df2):
x = (df1["col1"] + df1["col2"]).values
y = (df2["col1"] + df2["col2"]).values
all_values = (x[np.newaxis].T + y[np.newaxis])
sav_val = all_values[:, 0]
less_than_one = np.less_equal(all_values, 1)
less_than_sav_val = np.less(all_values.T, sav_val).T
passes_condition = less_than_one & less_than_sav_val
result = df2['OLDVALUE'].values.take(passes_condition.argmax(axis=1))
result[~passes_condition.any(axis=1)] = -999
return result
Testing with df1
with 1000 rows and df2
with 100 rows.
all(new_func(df1, df2) == rajat_func(df1, df2))
is True.
%timeit rajat_func(df1, df2)
gives
5.07 s ± 115 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit new_func(df1, df2)
gives
601 µs ± 17 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So quite an improvement! Running %time
on new_func
using a df1
with 1,000,000 rows and df2
with 1000 rows gives
CPU times: user 4.9 s, sys: 3.05 s, total: 7.96 s
Wall time: 7.99 s
Does this solve you problem, or have I completely misunderstood what you are trying to do?