Can one use comparisons to merge two pandas data-frames?

Question

With the following command:

pandas.merge(df_1, df_2, left_on=['date'], right_on=['from_date'])

I combine two rows from two tables if a value in date-column of the first table is equal to the value in the from_date-column of the second table.

Now I would like to make it slightly more complex. I need to combine a row from the first table with a row from the second table if the value in the date column of the first table is equal or lager than a value of the from_date-column of the second table and smaller than value in the upto_date-column of the second column.

In SQL one would use something like that:

select
    *
from
    table_1
join
    table_2
on
    table_1.date >= table_2.from_date
    and
    table_1.date <  table_2.upto_date

Is it possible to do it in pandas.

Since the values you're joining on are no longer unique, you may not have the merge working as you expect. Perhaps look into .join, or .concat if you're looking to simply add the two tables together — DataSwede, Sep 05 '14 at 22:08
Possible duplicate of http://stackoverflow.com/questions/23508351/how-to-do-a-conditional-join-in-python-pandas. There is a proposed issue regarding Conditional Join for Pandas DataFrame (https://github.com/pydata/pandas/issues/7480) — kushan_s, Sep 10 '14 at 09:26
Wondering if a non-SQL solution would be easier (ie: parse + merge in python). — Alvin K., Sep 14 '14 at 06:23

score 2 · Answer 1 · answered Sep 21 '14 at 17:21

pandasql is a pretty useful tool for querying pandas DataFrames using SQLite query syntax.

Resources

pandasql - PyPI Documentation
yhat/pandasql - Source on Github
Blog post with more examples

pip install -U pandasql

Here's an example similar to the one you describe.

Imports

#!/usr/bin/env python
# -*- coding: utf-8 -*- 
import pandas as pd
from pandas.io.parsers import StringIO
from pandasql import sqldf

# helper func useful for saving keystrokes
# when running multiple queries
def dbGetQuery(q):
    return sqldf(q, globals())

Fake some data

sample_a = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-03 00:00:00,5
2014-01-04 00:00:00,73
2014-01-05 00:00:00,40
2014-01-06 00:00:00,45
2014-01-08 00:00:00,2
2014-01-09 00:00:00,96
2014-01-10 00:00:00,82
2014-01-11 00:00:00,61
2014-01-12 00:00:00,68
2014-01-13 00:00:00,8
2014-01-14 00:00:00,94
2014-01-15 00:00:00,16
2014-01-16 00:00:00,31
2014-01-17 00:00:00,10
2014-01-18 00:00:00,34
2014-01-19 00:00:00,27
2014-01-20 00:00:00,75
2014-01-21 00:00:00,49
2014-01-23 00:00:00,28
2014-01-24 00:00:00,91
2014-01-25 00:00:00,88
2014-01-27 00:00:00,98
2014-01-28 00:00:00,39
2014-01-29 00:00:00,90
2014-01-30 00:00:00,63
2014-01-31 00:00:00,77
"""

sample_b = """from_date,to_date,measure
2014-01-02 00:00:00,2014-01-06 00:00:00,89
2014-01-03 00:00:00,2014-01-07 00:00:00,80
2014-01-04 00:00:00,2014-01-05 00:00:00,44
2014-01-05 00:00:00,2014-01-12 00:00:00,68
2014-01-06 00:00:00,2014-01-11 00:00:00,62
2014-01-07 00:00:00,2014-01-14 00:00:00,5
2014-01-08 00:00:00,2014-01-09 00:00:00,23
"""

Read datasets to create 2 DataFrames

df1 = pd.read_csv(StringIO(sample_a), parse_dates=['timepoint'])
df2 = pd.read_csv(StringIO(sample_b), parse_dates=['from_date', 'to_date'])

Write a SQL query

Note that this one uses the SQLite BETWEEN operator. You can also swap that out and use something like ON timepoint >= from_date AND timepoint < to_date if you prefer.

query = """
SELECT
    DATE(df1.timepoint) AS timepoint
    , DATE(df2.from_date) AS start
    , DATE(df2.to_date) AS end
    , df1.measure AS measure_a
    , df2.measure AS measure_b
FROM
    df1 
INNER JOIN df2
    ON df1.timepoint BETWEEN 
        df2.from_date AND df2.to_date
ORDER BY
    df1.timepoint;
"""

Run the query using the helper func

df3 = dbGetQuery(query)

df3
     timepoint       start         end  measure_a  measure_b
0   2014-01-03  2014-01-02  2014-01-06          5         89
1   2014-01-03  2014-01-03  2014-01-07          5         80
2   2014-01-04  2014-01-02  2014-01-06         73         89
3   2014-01-04  2014-01-03  2014-01-07         73         80
4   2014-01-04  2014-01-04  2014-01-05         73         44
5   2014-01-05  2014-01-02  2014-01-06         40         89
6   2014-01-05  2014-01-03  2014-01-07         40         80
7   2014-01-05  2014-01-04  2014-01-05         40         44
8   2014-01-05  2014-01-05  2014-01-12         40         68
9   2014-01-06  2014-01-02  2014-01-06         45         89
10  2014-01-06  2014-01-03  2014-01-07         45         80
11  2014-01-06  2014-01-05  2014-01-12         45         68
12  2014-01-06  2014-01-06  2014-01-11         45         62
13  2014-01-08  2014-01-05  2014-01-12          2         68
14  2014-01-08  2014-01-06  2014-01-11          2         62
15  2014-01-08  2014-01-07  2014-01-14          2          5
16  2014-01-08  2014-01-08  2014-01-09          2         23
17  2014-01-09  2014-01-05  2014-01-12         96         68
18  2014-01-09  2014-01-06  2014-01-11         96         62
19  2014-01-09  2014-01-07  2014-01-14         96          5
20  2014-01-09  2014-01-08  2014-01-09         96         23
21  2014-01-10  2014-01-05  2014-01-12         82         68
22  2014-01-10  2014-01-06  2014-01-11         82         62
23  2014-01-10  2014-01-07  2014-01-14         82          5
24  2014-01-11  2014-01-05  2014-01-12         61         68
25  2014-01-11  2014-01-06  2014-01-11         61         62
26  2014-01-11  2014-01-07  2014-01-14         61          5
27  2014-01-12  2014-01-05  2014-01-12         68         68
28  2014-01-12  2014-01-07  2014-01-14         68          5
29  2014-01-13  2014-01-07  2014-01-14          8          5
30  2014-01-14  2014-01-07  2014-01-14         94          5

Python tells me pandasql has no attirbute 'dbGetQuery'. I also can't find any information online on this module. Does this code really work? — Alexis Eggermont, Nov 11 '14 at 09:24
I defined dbGetQuery at the top of my answer. It's just a helper function I always write. — hernamesbarbara, Nov 11 '14 at 14:20

sammywemmy · Answer 2 · 2023-07-10T23:13:07.500

conditional_join from pyjanitor covers inequality joins efficiently :

Using @hernamesbarbara's fake data:

# pip install pyjanitor
import pandas as pd
import janitor

(df1.conditional_join(
         df2, 
         ('timepoint', 'from_date', '>='), 
         ('timepoint', 'to_date', '<='))
)
 
         left              right                   
    timepoint measure  from_date    to_date measure
0  2014-01-03       5 2014-01-02 2014-01-06      89
1  2014-01-03       5 2014-01-03 2014-01-07      80
2  2014-01-04      73 2014-01-02 2014-01-06      89
3  2014-01-04      73 2014-01-03 2014-01-07      80
4  2014-01-04      73 2014-01-04 2014-01-05      44
5  2014-01-05      40 2014-01-02 2014-01-06      89
6  2014-01-05      40 2014-01-03 2014-01-07      80
7  2014-01-05      40 2014-01-04 2014-01-05      44
8  2014-01-05      40 2014-01-05 2014-01-12      68
9  2014-01-06      45 2014-01-02 2014-01-06      89
10 2014-01-06      45 2014-01-03 2014-01-07      80
11 2014-01-06      45 2014-01-05 2014-01-12      68
12 2014-01-06      45 2014-01-06 2014-01-11      62
13 2014-01-08       2 2014-01-05 2014-01-12      68
14 2014-01-08       2 2014-01-06 2014-01-11      62
15 2014-01-08       2 2014-01-07 2014-01-14       5
16 2014-01-08       2 2014-01-08 2014-01-09      23
17 2014-01-09      96 2014-01-05 2014-01-12      68
18 2014-01-09      96 2014-01-06 2014-01-11      62
19 2014-01-09      96 2014-01-07 2014-01-14       5
20 2014-01-09      96 2014-01-08 2014-01-09      23
21 2014-01-10      82 2014-01-05 2014-01-12      68
22 2014-01-10      82 2014-01-06 2014-01-11      62
23 2014-01-10      82 2014-01-07 2014-01-14       5
24 2014-01-11      61 2014-01-05 2014-01-12      68
25 2014-01-11      61 2014-01-06 2014-01-11      62
26 2014-01-11      61 2014-01-07 2014-01-14       5
27 2014-01-12      68 2014-01-05 2014-01-12      68
28 2014-01-12      68 2014-01-07 2014-01-14       5
29 2014-01-13       8 2014-01-07 2014-01-14       5
30 2014-01-14      94 2014-01-07 2014-01-14       5

score 0 · Answer 3 · answered Nov 02 '21 at 07:10

I found a solution, I think. However, I am not sure if it is elegant and optimal:

df_1['A'] = 'A'
df_2['A'] = 'A'
df = pandas.merge(df_1, df_2, on=['A'])
df = df[(df['date'] >= df['from']) & (df['date'] < df['upto'])]
del df['A']

_{Posted on behalf of the question asker}

Can one use comparisons to merge two pandas data-frames?

3 Answers3