9

With the following command:

pandas.merge(df_1, df_2, left_on=['date'], right_on=['from_date'])

I combine two rows from two tables if a value in date-column of the first table is equal to the value in the from_date-column of the second table.

Now I would like to make it slightly more complex. I need to combine a row from the first table with a row from the second table if the value in the date column of the first table is equal or lager than a value of the from_date-column of the second table and smaller than value in the upto_date-column of the second column.

In SQL one would use something like that:

select
    *
from
    table_1
join
    table_2
on
    table_1.date >= table_2.from_date
    and
    table_1.date <  table_2.upto_date

Is it possible to do it in pandas.

Dharman
  • 30,962
  • 25
  • 85
  • 135
Roman
  • 124,451
  • 167
  • 349
  • 456
  • 1
    Could you provide a short sample of your df1 and df2? – FooBar Sep 05 '14 at 15:23
  • Since the values you're joining on are no longer unique, you may not have the merge working as you expect. Perhaps look into .join, or .concat if you're looking to simply add the two tables together – DataSwede Sep 05 '14 at 22:08
  • Possible duplicate of http://stackoverflow.com/questions/23508351/how-to-do-a-conditional-join-in-python-pandas. There is a proposed issue regarding Conditional Join for Pandas DataFrame (https://github.com/pydata/pandas/issues/7480) – kushan_s Sep 10 '14 at 09:26
  • Wondering if a non-SQL solution would be easier (ie: parse + merge in python). – Alvin K. Sep 14 '14 at 06:23

3 Answers3

2

pandasql is a pretty useful tool for querying pandas DataFrames using SQLite query syntax.

Resources

Here's an example similar to the one you describe.

Imports

#!/usr/bin/env python
# -*- coding: utf-8 -*- 
import pandas as pd
from pandas.io.parsers import StringIO
from pandasql import sqldf

# helper func useful for saving keystrokes
# when running multiple queries
def dbGetQuery(q):
    return sqldf(q, globals())

Fake some data

sample_a = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-03 00:00:00,5
2014-01-04 00:00:00,73
2014-01-05 00:00:00,40
2014-01-06 00:00:00,45
2014-01-08 00:00:00,2
2014-01-09 00:00:00,96
2014-01-10 00:00:00,82
2014-01-11 00:00:00,61
2014-01-12 00:00:00,68
2014-01-13 00:00:00,8
2014-01-14 00:00:00,94
2014-01-15 00:00:00,16
2014-01-16 00:00:00,31
2014-01-17 00:00:00,10
2014-01-18 00:00:00,34
2014-01-19 00:00:00,27
2014-01-20 00:00:00,75
2014-01-21 00:00:00,49
2014-01-23 00:00:00,28
2014-01-24 00:00:00,91
2014-01-25 00:00:00,88
2014-01-27 00:00:00,98
2014-01-28 00:00:00,39
2014-01-29 00:00:00,90
2014-01-30 00:00:00,63
2014-01-31 00:00:00,77
"""

sample_b = """from_date,to_date,measure
2014-01-02 00:00:00,2014-01-06 00:00:00,89
2014-01-03 00:00:00,2014-01-07 00:00:00,80
2014-01-04 00:00:00,2014-01-05 00:00:00,44
2014-01-05 00:00:00,2014-01-12 00:00:00,68
2014-01-06 00:00:00,2014-01-11 00:00:00,62
2014-01-07 00:00:00,2014-01-14 00:00:00,5
2014-01-08 00:00:00,2014-01-09 00:00:00,23
"""

Read datasets to create 2 DataFrames

df1 = pd.read_csv(StringIO(sample_a), parse_dates=['timepoint'])
df2 = pd.read_csv(StringIO(sample_b), parse_dates=['from_date', 'to_date'])

Write a SQL query

Note that this one uses the SQLite BETWEEN operator. You can also swap that out and use something like ON timepoint >= from_date AND timepoint < to_date if you prefer.

query = """
SELECT
    DATE(df1.timepoint) AS timepoint
    , DATE(df2.from_date) AS start
    , DATE(df2.to_date) AS end
    , df1.measure AS measure_a
    , df2.measure AS measure_b
FROM
    df1 
INNER JOIN df2
    ON df1.timepoint BETWEEN 
        df2.from_date AND df2.to_date
ORDER BY
    df1.timepoint;
"""

Run the query using the helper func

df3 = dbGetQuery(query)

df3
     timepoint       start         end  measure_a  measure_b
0   2014-01-03  2014-01-02  2014-01-06          5         89
1   2014-01-03  2014-01-03  2014-01-07          5         80
2   2014-01-04  2014-01-02  2014-01-06         73         89
3   2014-01-04  2014-01-03  2014-01-07         73         80
4   2014-01-04  2014-01-04  2014-01-05         73         44
5   2014-01-05  2014-01-02  2014-01-06         40         89
6   2014-01-05  2014-01-03  2014-01-07         40         80
7   2014-01-05  2014-01-04  2014-01-05         40         44
8   2014-01-05  2014-01-05  2014-01-12         40         68
9   2014-01-06  2014-01-02  2014-01-06         45         89
10  2014-01-06  2014-01-03  2014-01-07         45         80
11  2014-01-06  2014-01-05  2014-01-12         45         68
12  2014-01-06  2014-01-06  2014-01-11         45         62
13  2014-01-08  2014-01-05  2014-01-12          2         68
14  2014-01-08  2014-01-06  2014-01-11          2         62
15  2014-01-08  2014-01-07  2014-01-14          2          5
16  2014-01-08  2014-01-08  2014-01-09          2         23
17  2014-01-09  2014-01-05  2014-01-12         96         68
18  2014-01-09  2014-01-06  2014-01-11         96         62
19  2014-01-09  2014-01-07  2014-01-14         96          5
20  2014-01-09  2014-01-08  2014-01-09         96         23
21  2014-01-10  2014-01-05  2014-01-12         82         68
22  2014-01-10  2014-01-06  2014-01-11         82         62
23  2014-01-10  2014-01-07  2014-01-14         82          5
24  2014-01-11  2014-01-05  2014-01-12         61         68
25  2014-01-11  2014-01-06  2014-01-11         61         62
26  2014-01-11  2014-01-07  2014-01-14         61          5
27  2014-01-12  2014-01-05  2014-01-12         68         68
28  2014-01-12  2014-01-07  2014-01-14         68          5
29  2014-01-13  2014-01-07  2014-01-14          8          5
30  2014-01-14  2014-01-07  2014-01-14         94          5
hernamesbarbara
  • 6,850
  • 3
  • 26
  • 25
0

conditional_join from pyjanitor covers inequality joins efficiently :

Using @hernamesbarbara's fake data:

# pip install pyjanitor
import pandas as pd
import janitor

(df1.conditional_join(
         df2, 
         ('timepoint', 'from_date', '>='), 
         ('timepoint', 'to_date', '<='))
)
 
         left              right                   
    timepoint measure  from_date    to_date measure
0  2014-01-03       5 2014-01-02 2014-01-06      89
1  2014-01-03       5 2014-01-03 2014-01-07      80
2  2014-01-04      73 2014-01-02 2014-01-06      89
3  2014-01-04      73 2014-01-03 2014-01-07      80
4  2014-01-04      73 2014-01-04 2014-01-05      44
5  2014-01-05      40 2014-01-02 2014-01-06      89
6  2014-01-05      40 2014-01-03 2014-01-07      80
7  2014-01-05      40 2014-01-04 2014-01-05      44
8  2014-01-05      40 2014-01-05 2014-01-12      68
9  2014-01-06      45 2014-01-02 2014-01-06      89
10 2014-01-06      45 2014-01-03 2014-01-07      80
11 2014-01-06      45 2014-01-05 2014-01-12      68
12 2014-01-06      45 2014-01-06 2014-01-11      62
13 2014-01-08       2 2014-01-05 2014-01-12      68
14 2014-01-08       2 2014-01-06 2014-01-11      62
15 2014-01-08       2 2014-01-07 2014-01-14       5
16 2014-01-08       2 2014-01-08 2014-01-09      23
17 2014-01-09      96 2014-01-05 2014-01-12      68
18 2014-01-09      96 2014-01-06 2014-01-11      62
19 2014-01-09      96 2014-01-07 2014-01-14       5
20 2014-01-09      96 2014-01-08 2014-01-09      23
21 2014-01-10      82 2014-01-05 2014-01-12      68
22 2014-01-10      82 2014-01-06 2014-01-11      62
23 2014-01-10      82 2014-01-07 2014-01-14       5
24 2014-01-11      61 2014-01-05 2014-01-12      68
25 2014-01-11      61 2014-01-06 2014-01-11      62
26 2014-01-11      61 2014-01-07 2014-01-14       5
27 2014-01-12      68 2014-01-05 2014-01-12      68
28 2014-01-12      68 2014-01-07 2014-01-14       5
29 2014-01-13       8 2014-01-07 2014-01-14       5
30 2014-01-14      94 2014-01-07 2014-01-14       5
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
0

I found a solution, I think. However, I am not sure if it is elegant and optimal:

df_1['A'] = 'A'
df_2['A'] = 'A'
df = pandas.merge(df_1, df_2, on=['A'])
df = df[(df['date'] >= df['from']) & (df['date'] < df['upto'])]
del df['A']

Posted on behalf of the question asker

Dharman
  • 30,962
  • 25
  • 85
  • 135