1

Referring to the same question asked here join dataframes for single column , Now I want to extend this for two more columns for example:

df1:

price_start  price_end  year_start  year_end  score
         10         50        2001      2005     20
         60        100        2001      2005     50
         10         50        2006      2010     30

df2:

Price  year
   10  2001
   70  2002
   50  2010

Now I want to map the score from df1 with respect to df2 values.

expected output:

price year score

10 2001 20

70 2002 50

50 2010 30
sammywemmy
  • 27,093
  • 4
  • 17
  • 31

2 Answers2

3

Solution 1: Simple Solution for small dataset

For small dataset, you can cross join df1 and df2 by .merge(), then filter by the conditions where the Price is within range and year is within range using .query() specifying the conditions, as follows:

(df1.merge(df2, how='cross')
    .query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
    [['Price', 'year', 'score']]
)

If your Pandas version is older than 1.2.0 (released in December 2020) and does not support merge with how='cross', you can use:

(df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1)
    .query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
    [['Price', 'year', 'score']]
)

Result:

   Price  year  score
0     10  2001     20
4     70  2002     50
8     50  2010     30

Solution 2: Numpy Solution for large dataset

For large dataset and performance is a concern, you can use numpy broadcasting (instead of cross join and filtering) to speed up the execution time:

We look for Price in df2 is within price range in df1 and year in df2 is within year range in df1:

d2_P = df2.Price.values
d2_Y = df2.year.values

d1_PS = df1.price_start.values
d1_PE = df1.price_end.values
d1_YS = df1.year_start.values
d1_YE = df1.year_end.values

i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))

pd.DataFrame(
    np.column_stack([df1.values[j], df2.values[i]]),
    columns=df1.columns.append(df2.columns)
)[['Price', 'year', 'score']]

Result:

   Price  year  score
0     10  2001     20
1     70  2002     50
2     50  2010     30

Performance Comparison

Part 1: Compare for original datasets of 3 rows each:

Solution 1:

%%timeit
(df1.merge(df2, how='cross')
    .query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
    [['Price', 'year', 'score']]
)

5.91 ms ± 87.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Solution 2:

%%timeit
d2_P = df2.Price.values
d2_Y = df2.year.values

d1_PS = df1.price_start.values
d1_PE = df1.price_end.values
d1_YS = df1.year_start.values
d1_YE = df1.year_end.values

i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))

pd.DataFrame(
    np.column_stack([df1.values[j], df2.values[i]]),
    columns=df1.columns.append(df2.columns)
)[['Price', 'year', 'score']]

703 µs ± 9.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Benchmark summary: 5.91 ms vs 703 µs, that is 8.4x times faster

Part 2: Compare for datasets with 3,000 and 30,000 rows:

Data Setup:

df1a = pd.concat([df1] * 1000, ignore_index=True)
df2a = pd.concat([df2] * 10000, ignore_index=True)

Solution 1:

%%timeit
(df1a.merge(df2a, how='cross')
    .query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
    [['Price', 'year', 'score']]
)

27.5 s ± 3.24 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Solution 2:

%%timeit
d2_P = df2a.Price.values
d2_Y = df2a.year.values

d1_PS = df1a.price_start.values
d1_PE = df1a.price_end.values
d1_YS = df1a.year_start.values
d1_YE = df1a.year_end.values

i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))

pd.DataFrame(
    np.column_stack([df1a.values[j], df2a.values[i]]),
    columns=df1a.columns.append(df2a.columns)
)[['Price', 'year', 'score']]

3.83 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Benchmark summary: 27.5 s vs 3.83 s, that is 7.2x times faster

SeaBean
  • 22,547
  • 3
  • 13
  • 25
  • thanks @SeaBean for the answer. this works fine but need optimized one because in my case one of the dataframe is very huge and cross join consumes more time. – Devaraj Nadiger Sep 07 '21 at 12:16
  • 1
    @DevarajNadiger I've done some research and got a solution with considerable performance boost. Wrapping it up and doing some final testing. Will let you know when ready. Stay tuned! – SeaBean Sep 07 '21 at 22:48
0

One option is with conditional_join from pyjanitor, and is efficient for range joins as well, and is better than a naive cross join:

# pip install pyjanitor
# you can also install the dev version for the latest
# including the ability to use numba for faster performance
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git

import janitor
import pandas as pd

(df1
.conditional_join(
    df2, 
    ('price_start', 'Price', '<='), 
    ('price_end', 'Price', '>='), 
    ('year_start', 'year', '<='), 
    ('year_end', 'year', '>='))
.loc(axis=1)['Price','year','score']
)
   Price  year  score
0     10  2001     20
1     70  2002     50
2     50  2010     30

With the dev version, you can select the columns as well:

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git

import janitor
import pandas as pd

(df1
.conditional_join(
    df2, 
    ('price_start', 'Price', '<='), 
    ('price_end', 'Price', '>='), 
    ('year_start', 'year', '<='), 
    ('year_end', 'year', '>='),
    use_numba = False,
    right_columns = ['Price', 'year'],
    df_columns = 'score')
)
   score  Price  year
0     20     10  2001
1     50     70  2002
2     30     50  2010

With the dev version, if you have numba installed, you can turn on use_numba for more performance.

sammywemmy
  • 27,093
  • 4
  • 17
  • 31