Combining two pandas dataframes that do not match

Question

i have data of the shifts that are worked and data of the machine, i wish to establish a link between the machine state and the shift that was working at that time.

an example of the data of the machines:

DateTime                State     Comment
2023-01-01 09:01:300    running   
2023-01-01 10:10:230    stopped   Finished 
2023-01-01 22:50:670    running   
2023-01-02 07:58:990    stopped   Finished
2023-01-02 15:06:800    running
2023-01-02 20:23:550    stopped   User aborted 
2023-01-03 07:02:650    running 
2023-01-03 10:07:130    stopped   Finished

An example of the shift data where A,B,C are the shifts and M means morning and N means noon.

            A  B  C  
Date                  
2023-01-01  M  N  -  
2023-01-02  M  N  -  
2023-01-03  N  -  M  
2023-01-04  N  -  M

What i wish to code is that it combines the data in the format of the machine data

DateTime                State     Comment      Shift
2023-01-01 09:01:300    running                A
2023-01-01 10:10:230    stopped   Finished     A
2023-01-01 22:50:670    running                B
2023-01-02 07:58:990    stopped   Finished     A
2023-01-02 15:06:800    running                B
2023-01-02 20:23:550    stopped   User aborted B
2023-01-03 07:02:650    running                C
2023-01-03 10:07:130    stopped   Finished     C

Performance is not important? Added performance comparison to my answer and accpeted solution is 90 times slowier for 1k rows, I think if large data it is slowier. — jezrael, Apr 18 '23 at 09:57
Thank you for including a performance comparison in your answer. While your [solution](https://stackoverflow.com/a/76042230/14234692) is considerably faster, I believe that the difference in milliseconds may not be significant in this real-world scenario. In keeping with best practices for Python programming, prioritizing readability over premature optimization is crucial. Your solution is undoubtedly remarkable but may be difficult to comprehend. However, there is for sure potential for optimization in my proposed `get_shift()` function! @jezrael — psalt, Apr 19 '23 at 09:15
@psalt - Thank you for feedback. `I believe that the difference in milliseconds may not be significant in this real-world scenario` - I think in real scenario is not 10 or less rows, so difference 90 times is relevant. Btw, reason is because for each loop in `apply` test all rows, so it is slow. (`df_shift.loc[timeindex_date].where(df_shift.loc[timeindex_date] == daytime)`) If you create faster solution, let me know, I can add it to timings. — jezrael, Apr 19 '23 at 09:26
@psalt - add performance for 100k rows, it is 487 times slowier. — jezrael, Apr 19 '23 at 09:45
@jezreal its like psalt said, i like his code because it is very understandable. Although you showing me the speed of your code does make me curious. I will see if it works for me or that i have other bottlenecks in which case the speed might be less relevant. Thank you either way. — Joost Lenten, Apr 20 '23 at 11:26

jezrael · Answer 1 · 2023-04-19T09:43:38.440

First convert values to datetimes by to_datetime:

df1['DateTime'] = pd.to_datetime(df1['DateTime'], format='%Y-%m-%d %H:%M:%f')
df2.index = pd.to_datetime(df2.index)

Then reshape df2 by DataFrame.stack and DataFrame.reset_index:

df22 = df2.stack().rename_axis(('dates','Shift')).reset_index(name='value')
print (df22)
        dates Shift value
0  2023-01-01     A     M
1  2023-01-01     B     N
2  2023-01-01     C     -
3  2023-01-02     A     M
4  2023-01-02     B     N
5  2023-01-02     C     -
6  2023-01-03     A     N
7  2023-01-03     B     -
8  2023-01-03     C     M
9  2023-01-04     A     N
10 2023-01-04     B     -
11 2023-01-04     C     M

Create helper column by Series.dt.floor and numpy.where, so possible use left join with df22:

df = (df1.assign(dates = df1['DateTime'].dt.floor('D'),
                value = np.where(df1['DateTime'].dt.hour.le(12), 'M','N'))
         .merge(df22, how='left')
         .drop(['dates','value'], axis=1))

print (df)
                 DateTime    State       Comment Shift
0 2023-01-01 09:01:00.300  running           NaN     A
1 2023-01-01 10:10:00.230  stopped      Finished     A
2 2023-01-01 22:50:00.670  running           NaN     B
3 2023-01-02 07:58:00.990  stopped      Finished     A
4 2023-01-02 15:06:00.800  running           NaN     B
5 2023-01-02 20:23:00.550  stopped  User aborted     B
6 2023-01-03 07:02:00.650  running           NaN     C
7 2023-01-03 10:07:00.130  stopped      Finished     C

If both DatimeIndex:

df1.index = pd.to_datetime(df1.index, format='%Y-%m-%d %H:%M:%f')
df2.index = pd.to_datetime(df2.index)

df22 = df2.stack().rename_axis(('dates','Shift')).reset_index(name='value')

df = (df1.reset_index()
          .assign(dates = df1.index.floor('D'),
                  value = np.where(df1.index.hour < 12, 'M','N'))
         .merge(df22, how='left')
         .drop(['dates','value'], axis=1))
print (df)
                 DateTime    State       Comment Shift
0 2023-01-01 09:01:00.300  running           NaN     A
1 2023-01-01 10:10:00.230  stopped      Finished     A
2 2023-01-01 22:50:00.670  running           NaN     B
3 2023-01-02 07:58:00.990  stopped      Finished     A
4 2023-01-02 15:06:00.800  running           NaN     B
5 2023-01-02 20:23:00.550  stopped  User aborted     B
6 2023-01-03 07:02:00.650  running           NaN     C
7 2023-01-03 10:07:00.130  stopped      Finished     C

Testing performance in 1k rows:

np.random.seed(2023)
N = 1000
df1 = pd.DataFrame({'State': np.random.choice(['running','stopped'], size=N)},
                   index=pd.date_range('2000-01-01', freq='22H', periods=N))
#print (df1)

#https://stackoverflow.com/a/55317373/2901002
def shuffle_along_axis(a, axis):
    idx = np.random.rand(*a.shape).argsort(axis=axis)
    return np.take_along_axis(a,idx,axis=axis)

df2 = pd.DataFrame(shuffle_along_axis(np.tile(['M','N','-'], N).reshape(N, -1), axis=1), 
                 columns=list('ABC'),index=pd.date_range('2000-01-01', freq='D', periods=N))
#print (df2)

def get_shift(timeindex):
    daytime = "M" if timeindex.hour < 12 else "N"
    timeindex_date = timeindex.normalize()
    shifts = df2.loc[timeindex_date].where(df2.loc[timeindex_date] == daytime)
    return shifts.dropna().index.item()


In [133]: %timeit df1["Shift"] = df1.index.map(get_shift)
889 ms ± 5.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [134]: %%timeit
     ...: (df1.reset_index()
     ...:           .assign(dates = df1.index.floor('D'),
     ...:                   value = np.where(df1.index.hour < 12, 'M','N'))
     ...:          .merge(df2.stack().rename_axis(('dates','Shift'))
                             .reset_index(name='value'), how='left')
     ...:          .drop(['dates','value'], axis=1))
     ...:          
9.86 ms ± 54.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Testing in 100k rows:

np.random.seed(2023)
N = 100000
df1 = pd.DataFrame({'State': np.random.choice(['running','stopped'], size=N)},
                   index=pd.date_range('1900-01-01', freq='22H', periods=N))
#print (df1)

#https://stackoverflow.com/a/55317373/2901002
def shuffle_along_axis(a, axis):
    idx = np.random.rand(*a.shape).argsort(axis=axis)
    return np.take_along_axis(a,idx,axis=axis)

df2 = pd.DataFrame(shuffle_along_axis(np.tile(['M','N','-'], N).reshape(N, -1), axis=1), 
                 columns=list('ABC'),index=pd.date_range('1900-01-01', freq='D', periods=N))
#print (df2)



def get_shift(timeindex):
    daytime = "M" if timeindex.hour < 12 else "N"
    timeindex_date = timeindex.normalize()
    shifts = df2.loc[timeindex_date].where(df2.loc[timeindex_date] == daytime)
    return shifts.dropna().index.item()

In [2]: %timeit df1["Shift"] = df1.index.map(get_shift)
1min 35s ± 3.85 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %%timeit
   ...:         (df1.reset_index()
   ...:             .assign(dates = df1.index.floor('D'),
   ...:                    value = np.where(df1.index.hour < 12, 'M','N'))
   ...:             .merge(df2.stack().rename_axis(('dates','Shift'))
   ...:             .reset_index(name='value'), how='left')
   ...:             .drop(['dates','value'], axis=1))
   ...:             
195 ms ± 7.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

score 1 · Accepted Answer · answered Apr 18 '23 at 08:37

Here is another possible method that might be simpler and more readable.

First I recreate the two dataframes as good as possible:

import pandas as pd

df_shift = pd.DataFrame(data={"A": ["M", "M", "N", "N"], "B": ["N", "N", "-", "-"], "C": ["-", "-", "M", "M"]},
                        index=pd.date_range("2023-01-01", "2023-01-04"))

df_machine = pd.DataFrame(
    data={"State": ["running", "stopped", "running", "stopped",
                    "running", "stopped", "running", "stopped"],
          "Comment": ["", "Finished", "", "Finished", "", "User aborted", "", "Finished"]},
    index=["2023-01-01 09:01:300", "2023-01-01 10:10:230", "2023-01-01 22:50:670", "2023-01-02 07:58:990",
           "2023-01-02 15:06:800", "2023-01-02 20:23:550", "2023-01-03 07:02:650", "2023-01-03 10:07:130"])

df_machine.index = pd.to_datetime(df_machine.index, format="%Y-%m-%d %H:%M:%f")

To solve your problem, I created a new function get_shift() that looks up the shift in the df_shift table using pandas loc and where() methods. This function is then applied on the index of the machine data using map():

def get_shift(timeindex):
    daytime = "M" if timeindex.hour < 12 else "N"
    timeindex_date = timeindex.normalize()
    shifts = df_shift.loc[timeindex_date].where(df_shift.loc[timeindex_date] == daytime)
    return shifts.dropna().index.item()


df_machine["Shift"] = df_machine.index.map(get_shift)

print(df_machine)
                           State       Comment Shift
2023-01-01 09:01:00.300  running                   A
2023-01-01 10:10:00.230  stopped      Finished     A
2023-01-01 22:50:00.670  running                   B
2023-01-02 07:58:00.990  stopped      Finished     A
2023-01-02 15:06:00.800  running                   B
2023-01-02 20:23:00.550  stopped  User aborted     B
2023-01-03 07:02:00.650  running                   C
2023-01-03 10:07:00.130  stopped      Finished     C

Combining two pandas dataframes that do not match

2 Answers2