How to join pandas dataframe to itself by condition?

Question

I'm having a python pandas dataframe with 2 relevant columns "date" and "value", let's assume it looks like this and is ordered by date:

data = pd.DataFrame({"date": ["2021-01-01", "2021-01-31", "2021-02-01", "2021-02-28", "2021-03-01", "2021-03-31", "2021-04-01", "2021-04-02"],
                     "value": [1,2,3,4,5,6,5,8]})
data["date"] = pd.to_datetime(data['date'])

Now I want to join the dataFrame to itself in such a way, that I get for each last available day in month the next available day where the value is higher. In our example this should basically look like this:

date, value, date2, value2:
2021-01-31, 2, 2021-02-01, 3
2021-02-28, 4, 2021-03-01, 5
2021-03-31, 6, 2021-04-02, 8
2021-04-02, 8, NaN, NaN

My current partial solution to this problem looks like this:

last_days = data.groupby([data.date.dt.year, data.date.dt.month]).last()
res = [data.loc[(data.date>date) & (data.value > value)][:1] for date, value in zip(last_days.date, last_days.value)]
print(res)

But because of this answer "Don't iterate over rows in a dataframe", it doesn't feel like the pandas way to me.

So the question is, how to solve it the pandas way?

Cimbali · Accepted Answer · 2021-06-13T19:04:11.453

2

If you don’t have too many rows, you could generate all pairs of items and filter from there.

Let’s start with getting the last days in the month:

>>> last = data.loc[data['date'].dt.daysinmonth == data['date'].dt.day]
>>> last
        date  value
1 2021-01-31      2
3 2021-02-28      4
5 2021-03-31      6

Now use a cross join to map each last day to any possible day, then filter on criteria such as later date and larger value:

>>> pairs = pd.merge(last, data, how='cross', suffixes=('', '2'))
>>> pairs = pairs.loc[pairs['date2'].gt(pairs['date']) & pairs['value2'].gt(pairs['value'])]
>>> pairs
         date  value      date2  value2
2  2021-01-31      2 2021-02-01       3
3  2021-01-31      2 2021-02-28       4
4  2021-01-31      2 2021-03-01       5
5  2021-01-31      2 2021-03-31       6
6  2021-01-31      2 2021-04-01       5
7  2021-01-31      2 2021-04-02       8
12 2021-02-28      4 2021-03-01       5
13 2021-02-28      4 2021-03-31       6
14 2021-02-28      4 2021-04-01       5
15 2021-02-28      4 2021-04-02       8
23 2021-03-31      6 2021-04-02       8

Finally use GroupBy.idxmin() to get the first date2

>>> pairs.loc[pairs.groupby(['date', 'value'])['value2'].idxmin().values]
         date  value      date2  value2
2  2021-01-31      2 2021-02-01       3
12 2021-02-28      4 2021-03-01       5
23 2021-03-31      6 2021-04-02       8

Otherwise you might want apply, which is pretty much the same as iterating on rows to be entirely honest.

edited Jun 13 '21 at 19:04

answered Jun 12 '21 at 21:23

Cimbali

11,012
1
39
68

The filtered frame doesn't look like the expected output. – Henry Ecker Jun 12 '21 at 21:26
1

Ok, 2 improvements, but this looks like a good solution, thx! 1) edit "last" into the "merge": pd.merge(last, data, ...), otherwise it's wrong. 2) Your mask for last, is not exactly what i want, but maybe I wasn't precise enough, sorry. I'm working with busines days, so the last entry for a month in the data is not necessarily the last day of the month (e.g. when the last day of the month is a sunday). But my mask is working. If you fix 1) i will accept your answer. – jackattack Jun 13 '21 at 10:20
Additional Question: This looks more like a pandas ( / sql / database) approach to me and I think it was what I was looking for, thx again. But is better? And why? (I'm still learning pandas) I guess it needs more memory (compared to the loop/list comprehension), is faster? is it more readable? ...? – jackattack Jun 13 '21 at 10:24
@jackattack fixed it ! Regarding if it’s better: it all depends on the real-world use case. You’re right that in general you don’t want to iterate on rows, but if you have a lot of items it might be faster than generating all pairs, and will definitely take less memory. If you just want to explore pandas then it’s interesting to look at both options and how they work. – Cimbali Jun 13 '21 at 19:06

score 2 · Answer 2 · answered Jun 12 '21 at 21:46

2

First create 2 masks: one for the end day of month and another one for the first day of the next month.

m1 = data['date'].diff(1).shift(-1) == pd.Timedelta(days=1)
m2 = m1.shift(1, fill_value=False)

Finally, concatenate the 2 results ignoring index:

>>> pd.concat([data.loc[m1].reset_index(drop=True),
               data.loc[m2].reset_index(drop=True)], axis="columns")

        date  value       date  value
0 2021-01-31      2 2021-02-01      3
1 2021-02-28      4 2021-03-01      5
2 2021-03-31      6 2021-04-01      5
3 2021-04-01      5 2021-04-02      8

answered Jun 12 '21 at 21:46

Corralien

109,409
8
28
52

This is what I thought at first too. You missed, as I did, the "the next available day where the value is higher" part. '2021-03-21' should not pair with '2021-04-01' as value 5 is not greater than 6. – Henry Ecker Jun 12 '21 at 23:12
Yes, as @HenryEcker already wrote, this is unfortunately not a solution. We have 2 problems here: 1. m1 does not give us the last available day for each month in the dataframe. If I interpret it correctly, it gives us entries where the difference to the next entry is 1 day. 2. the next entry is not necessarily greater in value. – jackattack Jun 13 '21 at 09:29

sammywemmy · Answer 3 · 2022-02-17T23:33:28.720

One option is with the conditional_join from pyjanitor, which uses binary search underneath, and should be faster/more memory efficient than a cross merge, as the data size increases. Also, have a look at the piso library and see if it can be helpful/more efficient:

Get the last dates, via a groupby (assumption here is that the data is already sorted; if not, you can sort it before grouping):

# pip install pyjanitor
import pandas as pd
import janitor
trim = (data
         .groupby([data.date.dt.year, data.date.dt.month], as_index = False)
         .nth(-1)
        )

trim
        date  value
1 2021-01-31      2
3 2021-02-28      4
5 2021-03-31      6
7 2021-04-02      8

Use conditional_join to get rows where the value from trim is less than data, and the date from trim is less than data as well:

trimmed = trim.conditional_join(data, 
                      # variable arguments
                      # tuple is of the form:
                      # col_from_left_df, col_from_right_df, comparator
                      ('value', 'value', '<'), 
                      ('date', 'date', '<'), 
                      how = 'left')

trimmed

         left            right
         date value       date value
0  2021-01-31     2 2021-02-01   3.0
1  2021-01-31     2 2021-02-28   4.0
2  2021-01-31     2 2021-03-01   5.0
3  2021-01-31     2 2021-04-01   5.0
4  2021-01-31     2 2021-03-31   6.0
5  2021-01-31     2 2021-04-02   8.0
6  2021-02-28     4 2021-03-01   5.0
7  2021-02-28     4 2021-04-01   5.0
8  2021-02-28     4 2021-03-31   6.0
9  2021-02-28     4 2021-04-02   8.0
10 2021-03-31     6 2021-04-02   8.0
11 2021-04-02     8        NaT   NaN

Since the only interest is in the first match, a groupby is required.

trimmed = (trimmed
           .groupby(('left', 'date'), dropna = False, as_index = False)
           .nth(0)
           )

trimmed

         left            right
         date value       date value
0  2021-01-31     2 2021-02-01   3.0
6  2021-02-28     4 2021-03-01   5.0
10 2021-03-31     6 2021-04-02   8.0
11 2021-04-02     8        NaT   NaN

You can fix the columns, to flat form:

trimmed.set_axis(['date', 'value', 'date2', 'value2'], axis = 'columns')

         date  value      date2   value2
0  2021-01-31      2 2021-02-01      3.0
6  2021-02-28      4 2021-03-01      5.0
10 2021-03-31      6 2021-04-02      8.0
11 2021-04-02      8        NaT      NaN

How to join pandas dataframe to itself by condition?

3 Answers3