9

Currently I have two data frames representing excel spreadsheets. I wish to join the data where the dates are equal. This is a one to many join as one spread sheet has a date then I need to add data which has multiple rows with the same date

an example:

            A                  B
     date     data       date                 data
0    2015-0-1 ...     0  2015-0-1 to 2015-0-2 ...
1    2015-0-2 ...     1  2015-0-1 to 2015-0-2 ...

In this case both rows from A would recieve rows 0 and 1 from B because they are in that range.

I tried using

df3 = pandas.merge(df2, df1, how='right', validate='1:m', left_on='Travel Date/Range', right_on='End')

to accomplish this but received this error.

Traceback (most recent call last):
  File "<pyshell#61>", line 1, in <module>
    df3 = pandas.merge(df2, df1, how='right', validate='1:m', left_on='Travel Date/Range', right_on='End')
  File "C:\Users\M199449\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\reshape\merge.py", line 61, in merge
    validate=validate)
  File "C:\Users\M199449\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\reshape\merge.py", line 555, in __init__
    self._maybe_coerce_merge_keys()
  File "C:\Users\M199449\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\reshape\merge.py", line 990, in _maybe_coerce_merge_keys
    raise ValueError(msg)
ValueError: You are trying to merge on object and datetime64[ns] columns. If you wish to proceed you should use pd.concat

I can add more information as needed of course

wpercy
  • 9,636
  • 4
  • 33
  • 45
Ryan Schaefer
  • 3,047
  • 1
  • 26
  • 46
  • Provide importable data to play with. `df.to_json()` or something. – Maksim Terpilowski Aug 08 '18 at 20:47
  • 1
    The error says it all. you have mismatch type of object and datetime. – mad_ Aug 08 '18 at 20:48
  • @mad you misunderstand, I am wondering if there is a way to join if the date is in the date range, I recognize that the error says that it is not working because the are different types but I want to know an alternative way to accomplish this task. – Ryan Schaefer Aug 08 '18 at 20:50
  • Are your DataFrames excessively large? The fastest option is to perform an enormous merge and then subset the result. But you might be limited by memory. – ALollz Aug 08 '18 at 20:53
  • @ALollz the 1 side is 99 rows long and the many size is 10k long – Ryan Schaefer Aug 08 '18 at 20:54
  • @ALollz is there anyway I can take the date range from the 1 side and expand each date in the range out into its own row? – Ryan Schaefer Aug 08 '18 at 20:54
  • Yes, that is an option too See [Vectorizing a function using date range](https://stackoverflow.com/questions/49481064/python-pandas-trying-to-vectorize-a-function-using-date-range?noredirect=1#comment85987338_49481064). But I can tell you from experience it's still an extremely slow process. For this problem you have to trade off between memory and time, there's no way around it. – ALollz Aug 08 '18 at 20:56
  • @ALollz I have the memory to spare as I'm on a workstation – Ryan Schaefer Aug 08 '18 at 20:58
  • This is the way to do so , https://stackoverflow.com/questions/46525786/how-to-join-two-dataframes-for-which-column-values-are-within-a-certain-range/46526249#46526249, hope you close this question , since we have the similar question already – BENY Aug 16 '18 at 01:19
  • @RyanSchaefer I think your issue with column B having only one date value is somewhat tangential to the problem at hand. Basically, for the rows with a single date, you need to set both the start and end date to that same date, and then you can just apply any of these methods – ALollz Aug 21 '18 at 14:32

2 Answers2

13

So here's the option with merging:

Assume you have two DataFrames:

import pandas as pd
df1 = pd.DataFrame({'date': ['2015-01-01', '2015-01-02', '2015-01-03'], 
                    'data': ['A', 'B', 'C']})
df2 = pd.DataFrame({'date': ['2015-01-01 to 2015-01-02', '2015-01-01 to 2015-01-02', '2015-01-02 to 2015-01-03'], 
                    'data': ['E', 'F', 'G']})

Now do some cleaning to get all of the dates you need and make sure they are datetime

df1['date'] = pd.to_datetime(df1.date)

df2[['start', 'end']] = df2['date'].str.split(' to ', expand=True)
df2['start'] = pd.to_datetime(df2.start)
df2['end'] = pd.to_datetime(df2.end)
# No need for this anymore
df2 = df2.drop(columns='date')

Now merge it all together. You'll get 99x10K rows.

df = df1.assign(dummy=1).merge(df2.assign(dummy=1), on='dummy').drop(columns='dummy')

And subset to the dates that fall in between the ranges:

df[(df.date >= df.start) & (df.date <= df.end)]
#        date data_x data_y      start        end
#0 2015-01-01      A      E 2015-01-01 2015-01-02
#1 2015-01-01      A      F 2015-01-01 2015-01-02
#3 2015-01-02      B      E 2015-01-01 2015-01-02
#4 2015-01-02      B      F 2015-01-01 2015-01-02
#5 2015-01-02      B      G 2015-01-02 2015-01-03
#8 2015-01-03      C      G 2015-01-02 2015-01-03

If for instance, some dates in df2 were a single date, since we're using .str.split we will get None for the second date. Then just use .loc to set it appropriately.

df2 = pd.DataFrame({'date': ['2015-01-01 to 2015-01-02', '2015-01-01 to 2015-01-02', '2015-01-02 to 2015-01-03',
                             '2015-01-03'], 
                    'data': ['E', 'F', 'G', 'H']})

df2[['start', 'end']] = df2['date'].str.split(' to ', expand=True)
df2.loc[df2.end.isnull(), 'end'] = df2.loc[df2.end.isnull(), 'start']
#  data      start        end
#0    E 2015-01-01 2015-01-02
#1    F 2015-01-01 2015-01-02
#2    G 2015-01-02 2015-01-03
#3    H 2015-01-03 2015-01-03

Now the rest follows unchanged

ALollz
  • 57,915
  • 7
  • 66
  • 89
3

Let's use this numpy method by @piRSquared:

df1 = pd.DataFrame({'date': ['2015-01-01', '2015-01-02', '2015-01-03'], 
                    'data': ['A', 'B', 'C']})
df2 = pd.DataFrame({'date': ['2015-01-01 to 2015-01-02', '2015-01-01 to 2015-01-02', '2015-01-02 to 2015-01-03'], 
                    'data': ['E', 'F', 'G']})

df2[['start', 'end']] = df2['date'].str.split(' to ', expand=True)
df2['start'] = pd.to_datetime(df2.start)
df2['end'] = pd.to_datetime(df2.end)
df1['date'] = pd.to_datetime(df1['date'])

a = df1['date'].values
bh = df2['end'].values
bl = df2['start'].values

i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))

pd.DataFrame(np.column_stack([df1.values[i], df2.values[j]]),
             columns=df1.columns.append(df2.columns))

Output:

                  date data                      date data                start                  end
0  2015-01-01 00:00:00    A  2015-01-01 to 2015-01-02    E  2015-01-01 00:00:00  2015-01-02 00:00:00
1  2015-01-01 00:00:00    A  2015-01-01 to 2015-01-02    F  2015-01-01 00:00:00  2015-01-02 00:00:00
2  2015-01-02 00:00:00    B  2015-01-01 to 2015-01-02    E  2015-01-01 00:00:00  2015-01-02 00:00:00
3  2015-01-02 00:00:00    B  2015-01-01 to 2015-01-02    F  2015-01-01 00:00:00  2015-01-02 00:00:00
4  2015-01-02 00:00:00    B  2015-01-02 to 2015-01-03    G  2015-01-02 00:00:00  2015-01-03 00:00:00
5  2015-01-03 00:00:00    C  2015-01-02 to 2015-01-03    G  2015-01-02 00:00:00  2015-01-03 00:00:00
Scott Boston
  • 147,308
  • 15
  • 139
  • 187