How to join two dataframes for which column time values are within a certain range and are not datetime or timestamp objects?

Question

I have two dataframes as shown below:

     time browncarbon blackcarbon
 181.7335    0.105270         NaN
 181.3809    0.166545    0.001217
 181.6197    0.071581         NaN

 422 rows x 3 columns

   start       end    toc 
179.9989  180.0002  155.0
180.0002  180.0016  152.0
180.0016  180.0030  151.0

1364 rows x 3 columns

The first dataframe has a time column that has instants every four minutes. The second dataframe has a two time columns spaced every two minutes. Both these time columns do not start and end at the same time. However, they contain data collected over the same day. How could I make another dataframe containing:

time browncarbon blackcarbon toc

422 rows X 4 columns

There is a related answer on Stack Overflow, however, that is applicable only when the time columns are datetime or timestamp objects. The link is: How to join two dataframes for which column values are within a certain range?

Addendum 1: The multiple start and end rows that get encapsulated into one of the time rows should also correspond to one toc row, as it does right now, however, it should be the average of the multiple toc rows, which is not the case presently.

Addendum 2: Merging two pandas dataframes with complex conditions

How would you fill the `toc` column? where time falls into the range? — Quang Hoang, Jul 16 '19 at 21:19
And also, your sample data does not really reflect your description. For example, I don't see how the first dataframe is spaced every 4 minutes. — Quang Hoang, Jul 16 '19 at 21:21
I agree, this would not give any result since in your example non of the times fall in between the start & end from second dataframe. Change your example dataframe so that any of the `time` values from the first dataframe fall in the range of the second dataframe — Erfan, Jul 16 '19 at 21:22
@QuangHoang Yes, the toc column should be filled where time falls in the range. — Sujai Banerji, Jul 16 '19 at 21:24
@QuangHoang the time is in a peculiar format. The integer part represents the day of the year. The decimal part represents the number of seconds passed since the end of the last day. For example, 12:00 PM would be 12*3600/86400. — Sujai Banerji, Jul 16 '19 at 21:27
I am saying that, the time in the first dataset is not increasing. — Quang Hoang, Jul 16 '19 at 21:28
@Erfan There are times that fall in between. I have only shown the first three rows of both the dataframes in which there were none. — Sujai Banerji, Jul 16 '19 at 21:28
@QuangHoang The time in the first dataframe is in random order, however, in the second dataframe, it is in increasing order. — Sujai Banerji, Jul 16 '19 at 21:29
You can use `merge_asof`, but you need to sort the time in the first dataframe. You can certainly sort in back after merge. — Quang Hoang, Jul 16 '19 at 21:31
@QuangHoang Would you want to answer the question with a code snippet? — Sujai Banerji, Jul 16 '19 at 21:35

score 1 · Answer 1 · answered Jul 16 '19 at 21:34

1

We create a artificial key column to do an outer merge to get the cartesian product back (all matches between the rows). Then we filter all the rows where time falls in between the range with .query.

note: I edited the value of one row so we can get a match (see row 0 in example dataframes on the bottom)

df1.assign(key=1).merge(df2.assign(key=1), on='key', how='outer')\
   .query('(time >= start) & (time <= end)')\
   .drop(['key', 'start', 'end'], axis=1)

output

       time  browncarbon  blackcarbon    toc
1  180.0008      0.10527          NaN  152.0

Example dataframes used:

df1:

       time  browncarbon  blackcarbon
0  180.0008     0.105270          NaN
1  181.3809     0.166545     0.001217
2  181.6197     0.071581          NaN

df2:

      start       end    toc
0  179.9989  180.0002  155.0
1  180.0002  180.0016  152.0
2  180.0016  180.0030  151.0

answered Jul 16 '19 at 21:34

Erfan

40,971
8
66
78

What an answer! I am dumbstruck. – Sujai Banerji Jul 16 '19 at 21:57
Thanks, hope it helped you :) @SujaiBanerji – Erfan Jul 16 '19 at 21:58
However, I am getting more rows than necessary. I am getting 455 rows x 4 columns, when I ran your code. How is that possible? – Sujai Banerji Jul 16 '19 at 22:05
I used inclusive boundaris with `>=` and `<=`. Play around with `>` and `<` or any other combinations of those @SujaiBanerji . Thats why its important to include representative example data. – Erfan Jul 16 '19 at 22:06
Interesting, if I remove either of the equal to signs, I get 421 rows x 4 columns. – Sujai Banerji Jul 16 '19 at 22:10
I am going ahead with (time > start) & (time <= end), but I still ]cannot figure out the mystery of the missing row. – Sujai Banerji Jul 17 '19 at 04:02
The output that Quang Hoang's code gives has 422 rows x 4 columns. – Sujai Banerji Jul 17 '19 at 06:47

score 0 · Answer 2 · answered Jul 16 '19 at 21:37

Since the start and end intervals are mutually exclusive, we may be able to create new columns in df2 such that it would contain all the integer values in the range of floor(start) and floor(end). Later, add another column in df1 as floor(time) and then take left outer join on df1 and df2. I think that should do except that you may have to remove nan values and extra columns if required. If you send me the csv files, I may be able to send you the script. I hope I answered your question.

Erfan has already answered the question. However, thank you very much for answering the question. — Sujai Banerji, Jul 16 '19 at 22:00

Quang Hoang · Answer 3 · 2019-07-16T22:54:41.857

Your 2nd data frame is too short, so it wouldn't reflect a meaningful merge. So I modified it a little:

df2 = pd.DataFrame({'start': [179.9989, 180.0002, 180.0016, 181.3, 181.5, 181.7],
 'end': [180.0002, 180.0016, 180.003, 181.5, 185.7, 181.8],
 'toc': [155.0, 152.0, 151.0, 150.0, 149.0, 148.0]})

df1['Rank'] = np.arange(len(df1))

new_df = pd.merge_asof(df1.sort_values('time'), df2,
          left_on='time',
          right_on='start')

gives you:

       time  browncarbon  blackcarbon  Rank  start    end    toc
0  181.3809     0.166545     0.001217     1  181.3  181.5  150.0
1  181.6197     0.071581          NaN     2  181.5  185.7  149.0
2  181.7335     0.105270          NaN     0  181.7  181.8  148.0

which you can drop extra column and sort_values on Rank. For example:

new_df.sort_values('Rank').drop(['Rank','start','end'], axis=1)

gives:

       time  browncarbon  blackcarbon    toc
2  181.7335     0.105270          NaN  148.0
0  181.3809     0.166545     0.001217  150.0
1  181.6197     0.071581          NaN  149.0

Actually, your answer is the correct answer. However, I will be even more grateful to you if you could include the condition I have stated in the addendum. — Sujai Banerji, Jul 17 '19 at 06:33

score 0 · Answer 4 · answered Jul 16 '19 at 21:57

Perhaps you could just convert your columns to Timestamps and then use the answer in the other question you linked

from pandas import Timestamp
from dateutil.relativedelta import relativedelta as rd

def to_timestamp(x):
    return Timestamp(2000, 1, 1) + rd(days=x)

df['start_time'] = df.start.apply(to_timestamp)
df['end_time'] = df.end.apply(to_timestamp)

How to join two dataframes for which column time values are within a certain range and are not datetime or timestamp objects?

4 Answers4