0

I am performing data manipulation in python using pandas on a very large dataset, say 100 million rows. I have two dataframes and wish to generate third dataframe as per the conditions mentioned, the scenario is explained below:

dataframe 1:
Col_B and Col_D are of int64 type

Col_A   Col_B   Col_C   Col_D
 A       11      B       20
 A       11      C       24
 B       14      R       32
...      ...    ...      ...

dataframe 2:
Col_Z is of type float64 and remaining columns are of int64

Col_X   Col_Y   Col_P   Col_Q   Col_Z
 10      15      16      21      0.99
 10      15      17      22      0.89
...     ...     ...     ...      ...
...     ...     ...     ...      ...
 11      15      30      35      0.67
...     ...     ...     ...      ...

Condition to be applied: Consider only first row of both the dataframe, for the sake of understanding conditions:

if the value of (Col_B is between the value of Col_X and Col_Y) and value of (Col_D is between the value of Col_P and Col_Q) then return the corresponding value of Col_A, Col_C and Col_Z, otherwise return NaN

Expected Output (Dataframe 3):

Col_A   Col_C   Col_Z
 A       B       0.99
NaN     NaN      NaN
 B       R       0.67

Note: This output is generated merely considering if there are only these three rows in dataframes but in actual each value of Dataframe 1 has to scan all of the values in Dataframe 2 until desired conditions is achieved.

My Code:

df3 = {}
Col_A = []
Col_C = []
Col_Z = []
for i in df1.iterrows():    
    value = float(df2[(i[1][1] > df2['Col_X'].values) &
      (i[1][1] < df2['Col_Y'].values) &
      (i[1][3] > df2['Col_P'].values) &
      (i[1][3] < df2['Col_Q'].values)]['Col_Z'])

    if bool(value):
        Col_Z.append(value)
        Col_A.append(i[1][0])
        Col_C.append(i[1][2])
    else:
        Col_Z.append(float('NaN'))
        Col_A.append(float('NaN'))
        Col_C.append(float('NaN'))

This code is working fine uptill the condition is met, as soon as condition does'nt met, it throws a TypeError. Please can any rectify this.

Also, I wanted to know if there is any alternate and efficient way to perform it. Please let me know.

Rawss24
  • 65
  • 1
  • 9

2 Answers2

2

New answer:

You have two embbeded loops. The first one is always a full loop. But not the second. So the question is how to improve the efficiency of the sub partial loop.

Here, I give you two ways to perform the second loop:

  • The first considers the dataset as a whole, processing all the data and select the interesting values
  • The second iterates the rows till matching the condition.

This discussion might give you some insights on how to perform the iterations.

# Import module
import numpy as np

df1 = pd.DataFrame([["A", 11,  "B", 20],
                    ["A", 11,  "C", 24],
                    ["B", 14,  "R", 32]],
                   columns=["Col_A", "Col_B", "Col_C", "Col_D"])
df2 = pd.DataFrame([[10, 15,  16, 21, 0.99],
                    [10, 15,  17, 22, 0.89],
                    [11, 15,  30, 35, 0.67]],
                   columns=["Col_X", "Col_Y", "Col_P", "Col_Q", "Col_Z"])

def getCondition(row, df2):
    # Iterate df2 till a row meets the condition
    for _, row_df2 in df2.iterrows():
        if row_df2.Col_X <= row.Col_B and row.Col_B < row_df2.Col_Y \
            and row_df2.Col_P <= row.Col_D and row.Col_D < row_df2.Col_Q:
            return pd.Series([row.Col_A, row.Col_C, row_df2.Col_Z])
    return np.NaN


def getCondition2(row, df2):
    # Find all rows matching the condition and select the first
    condition = ((df2.Col_X <= row.Col_B) & (row.Col_B < df2.Col_Y)\
        & (df2.Col_P <= row.Col_D) & (row.Col_D < df2.Col_Q))
    if sum(condition) > 0:
        return pd.Series([row.Col_A, row.Col_C, df2.Col_Z[condition].iloc[0]])
    return np.NaN


# Apply the condition
output = df1.apply(getCondition2, args=[df2], axis=1)
print(output)
#      0    1     2
# 0    A    B  0.99
# 1  NaN  NaN   NaN
# 2    B    R  0.67

Old answer:

You can do this by considering the data set as a whole.

  • Firstly, for more convenience, I suggest you to join your two dataset as one dataset. You can do it with the merge function or just concat. Here, I use concat since another solution uses merge. To be clear with what there performing, you can have a look at this.
  • Then, you can define you condition on the whole columns. Take care of the and operator that becomes &.
  • Finally, you can call the where function that returns Nan when the condition isn't satisfied.

  • To fit the desired output, you can filter the columns using iloc or just calling the columns name.

Here the code:

# Import module
import pandas as pd

df1 = pd.DataFrame([["A", 11,  "B", 20],
                    ["A", 11,  "C", 24],
                    ["B", 14,  "R", 19]],
                   columns=["Col_A", "Col_B", "Col_C", "Col_D"])
df2 = pd.DataFrame([[10, 15,  16, 21, 0.99],
                    [10, 15,  17, 22, 0.89],
                    [11, 15,  16, 20, 0.67]],
                   columns=["Col_X", "Col_Y", "Col_P", "Col_Q", "Col_Z"])

# Concat the dataframe
df = pd.concat([df1, df2], axis=1)
print(df)

# Define the conditions
condition_col_b = ((df.Col_X <= df.Col_B) & (df.Col_B < df.Col_Y))
condition_col_d = ((df.Col_P <= df.Col_D) & (df.Col_D < df.Col_Q))

print(condition_col_b & condition_col_d)
# 0     True
# 1    False
# 2     True

# Apply the condition
output = df.where(condition_col_b & condition_col_d)
print(output)
#   Col_A  Col_B Col_C  Col_D  Col_X  Col_Y  Col_P  Col_Q  Col_Z
# 0     A   11.0     B   20.0   10.0   15.0   16.0   21.0   0.99
# 1   NaN    NaN   NaN    NaN    NaN    NaN    NaN    NaN    NaN
# 2     B   14.0     R   19.0   11.0   15.0   16.0   20.0   0.67

# Filter output
print(output[['Col_A', 'Col_C', 'Col_Z']])
#   Col_A Col_C  Col_Z
# 0     A     B   0.99
# 1   NaN   NaN    NaN
# 2     B     R   0.67
Alexandre B.
  • 5,387
  • 2
  • 17
  • 40
  • Good to know. Can you update the question with a more relevant example (number of rows different). Not sure to understand how the condition is computed, can you explain it ? – Alexandre B. Jul 18 '19 at 15:51
  • Sorry I accidentally deleted the last comment. I have updated the dataframe 2 table in my explanation. – Rawss24 Jul 18 '19 at 16:12
  • I can't concatenate and apply condition because firstly, the size of the two dataframes is not equal and that will generate huge number of unrequired NaN values plus the dataframe 2 is like a lookup table, the value from dataframe 1 Col_B or Col_D, not necessarily be in the same index of dataframe 2 Col's and hence it needs to scan the whole dataframe 2 before finalizing which Col_A and Col_C and Col_Z needs to be returned. – Rawss24 Jul 18 '19 at 16:13
  • I meant to say was that the rows in dataframe 2 can be in any row and vale of dataframe 1 table needs to search where it is and pull out the corresponding values of Col_A, Col_C, Col_Z – Rawss24 Jul 18 '19 at 16:15
  • So the question is how are you joining the both dataframe ? – Alexandre B. Jul 18 '19 at 20:51
  • No. I can definitely concatenate the two dataframes. Firstl, it would not make sense to join them, my point is that both dataframes have unequal number of rows and value of Col_B, say in this example is 11, so it should look in dataframe 2 in the Col_X and Col_Y, if the 11 is between the value of Col_X and Col_Y then it should return the values of Col_A, Col_C and Col_Z.
    Note:-It is not necessary that the both dataframes have same index.
    – Rawss24 Jul 18 '19 at 21:11
  • *if the 11 is between the value of Col_X and Col_Y* not make sense. 11 is a number while `Col_X` and `Col_Y` are `list`. If you want to compare a number with a list, you should either define an aggragation rule (min, max, mean) or look at [`any` and `all`](https://stackoverflow.com/questions/19389490/how-do-pythons-any-and-all-functions-work). – Alexandre B. Jul 18 '19 at 21:16
  • Okay just consider one row of dataframe 1 and one row of dataframe 2, that is the first row in each case. when we look in the condition, it says that value of Col_B should be in between the values present in Col_X and Col_Y. 11 in this case is value in Col_B and that is exactly in between the 10 and 15 which is the value of Col_X and Col_Y. Hence return corresponding value of Col_A, Col_C, Col_Z where this condition matched. This was pretty easy case where everything was in first row but imagine if the value of Col_X and Col_Y was not present in 1st row and instead it was present in, say 100 – Rawss24 Jul 18 '19 at 21:20
  • So, I want that value of Col_B, that is 11, should lookup all the values of dataframe 2's Col_X and Col_Y, till it meets the criteria – Rawss24 Jul 18 '19 at 21:21
  • So you select the first row that meets the condition ? – Alexandre B. Jul 18 '19 at 21:24
  • 1
    Yes, but as per my case, there would be always one such row only everytime – Rawss24 Jul 18 '19 at 21:25
  • Thank you for your new answer. In your new answer, you have mentioned one of the argument as df2=df2. I don't think that will work and I believe it will give `TypeError`. – Rawss24 Jul 19 '19 at 15:49
  • I use `df2=df2` to pass `df2` as 2nd argument of `getCondition` and `getCondition2`. You're right, the documentation suggests using `args=[...]`. It's done in the update. Thank you ! – Alexandre B. Jul 19 '19 at 15:58
  • I tried your new answer with getcondition2 function, it is working fine with small datasets. Thank you for that. It is much more smaller in code than my new answer. But unfortunately, it sure does taking a lot amount of time- same as mine- as I have to run it over 400 million records – Rawss24 Jul 19 '19 at 16:42
0

After a few trials, I was able to solve my own code. Here it is - the rectified one:

df3 = {}
Col_A = []
Col_C = []
Col_Z = []
for i in df1.iterrows():    
    value = df2[(i[1][1] > df2['Col_X'].values) &
      (i[1][1] < df2['Col_Y'].values) &
      (i[1][3] > df2['Col_P'].values) &
      (i[1][3] < df2['Col_Q'].values)]['Col_Z']

    if value.empty:
        continue
    else:
        Col_Z.append(value)
        Col_A.append(i[1][0])
        Col_C.append(i[1][2])

df3['A'] = Col_A
df3['C'] = Col_C
df3['Z'] = Col_Z
df3 = pd.DataFrame(df3)

However, due to the for loop that is iterating over all the rows, this method may not be efficient for large datasets, say for 100-200 million. Hoping to find some more efficient way!

Rawss24
  • 65
  • 1
  • 9