0

I want to do a left join with a condition (if left.values >= right.lo & left.values < right.hi)

So I wrote the following code, to:

  1. Drop out duplicates of the keys on the left data base
  2. Do the above conditional merge with the right data base
  3. Then merge the resulting database with the initial left data base

So my helper functions are:

import pandas as pd
import numpy as np

data = pd.read_pickle("C:/Quang/base_datalake_net.pkl")

t_BEH_VitMaxi = pd.read_csv("table/VEH_VitMaxi.csv", delimiter=';', decimal=',')
t_VEH_Age = pd.read_csv("table/VEH_Age.csv", delimiter=';', decimal=',')


def left_cond_merge_simple_help(left, right, left_on, right_on_lo, right_on_hi):
    left.reset_index(drop=True, inplace=True)
    right.reset_index(drop=True, inplace=True)
    a = left[left_on].values
    bh = right[right_on_hi].values
    bl = right[right_on_lo].values
    i, j = np.where((a[:, None] >= bl) & (a[:, None] < bh))
    result = pd.concat([left.loc[i].reset_index(drop=True),
                        right.loc[j].reset_index(drop=True)],
                       axis=1).append(
        left[~np.in1d(np.arange(len(left)), np.unique(i))], ignore_index=True)
    return result


def left_cond_merge_simple(left, right, left_on, right_on_lo, right_on_hi):
    temp = pd.DataFrame({left_on: left[left_on].unique()})
    temp = left_cond_merge_simple_help(left=temp, right=right, left_on=left_on,
                                       right_on_lo=right_on_lo, right_on_hi=right_on_hi)
    return left.merge(temp, on=left_on, how='left')

The strange behavior is that:

If I run only this line of code, it takes 4 secs, which is anormally long, because of the step 3, while my left db is only 36000x300 in dimensions, my right db is 20x5 in dimensions, the keys on the right are uniques.

% time data = left_cond_merge_simple(left=data, right=t_VEH_Age, left_on='VEH_Age', 
                                     right_on_lo='lo', right_on_hi='hi')

But if I run that line of code after this one (which takes also 4 secs), it only takes 0.1 sec:

% time data = left_cond_merge_simple(left=data, right=t_BEH_VitMaxi, left_on='VEH_VitMaxi',
                                     right_on_lo='lo', right_on_hi='hi')

What can be the reason?

My right db looks like:

enter image description here

Metariat
  • 522
  • 3
  • 16

0 Answers0