Combine two rows in pandas data frame having same values in multiple columns and comparing data in another column

Question

I am very new to python pandas. I have a sorted pandas data frame with 10k+ rows. Here is the sample data frame:

Example:

             0         1                  2       3        4             5

Hour:12 Min:31 Sec:24 Ms    E_ID:459   Name:I_SECONDROW UE_C:10  M_ID:93  C_ID_1:20337      
Hour:12 Min:32 Sec:33 Ms    E_ID:459   Name:M_FIRSTROWW UE_C:10  M_ID:93  C_ID_1:20337      
Hour:12 Min:30 Sec:31 Ms    E_ID:459   Name:M_FIRSTROWW UE_C:10  M_ID:93  C_ID_1:20337      
Hour:12 Min:32 Sec:33 Ms    E_ID:459   Name:M_FIRSTROWW UE_C:10  M_ID:93  C_ID_1:20337         
Hour:12 Min:31 Sec:19 Ms    E_ID:459   Name:I_SECONDROW UE_C:10  M_ID:93  C_ID_1:20337      
Hour:12 Min:32 Sec:22 Ms    E_ID:459   Name:M_FIRSTROWW UE_C:10  M_ID:93  C_ID_1:20337     
Hour:12 Min:30 Sec:26 Ms    E_ID:459   Name:I_SECONDROW UE_C:10  M_ID:93  C_ID_1:20337     
Hour:12 Min:30 Sec:26 Ms    E_ID:459   Name:I_SECONDROW UE_C:10  M_ID:93  C_ID_1:20337          
Hour:12 Min:30 Sec:26 Ms    E_ID:459   Name:I_SECONDROW UE_C:10  M_ID:93  C_ID_1:20337          
Hour:12 Min:32 Sec:17 Ms    E_ID:459   Name:I_SECONDROW UE_C:10  M_ID:93  C_ID_1:20337           
Hour:12 Min:30 Sec:24 Ms    E_ID:459   Name:I_SECONDROW UE_C:10  M_ID:93  C_ID_1:20337            
Hour:12 Min:32 Sec:46 Ms    E_ID:459   Name:I_SECONDROW UE_C:9   M_ID:93  C_ID_1:20337          
Hour:12 Min:30 Sec:24 Ms    E_ID:500   Name:I_SECONDROW UE_C:1   M_ID:80  C_ID_1:20110         
Hour:12 Min:30 Sec:26 Ms    E_ID:500   Name:M_FIRSTROWW UE_C:1   M_ID:80  C_ID_1:20110

Now I want to combine 2 rows(pair) with NAME as M_FIRSTROWW & I_SECONDROW and having same data in column 1, 3, 4, 5.

selected pairs should have Time difference less than or equal to 5 sec only.

Expected output:

Hour:12 Min:30 Sec:24 Ms    E_ID:500   Name:I_SECONDROW UE_C:1   M_ID:80  C_ID_1:20110         
Hour:12 Min:30 Sec:26 Ms    E_ID:500   Name:M_FIRSTROWW UE_C:1   M_ID:80  C_ID_1:20110


Hour:12 Min:30 Sec:31 Ms    E_ID:459   Name:M_FIRSTROWW UE_C:10  M_ID:93  C_ID_1:20337 
Hour:12 Min:30 Sec:26 Ms    E_ID:459   Name:I_SECONDROW UE_C:10  M_ID:93  C_ID_1:20337 


Hour:12 Min:32 Sec:22 Ms    E_ID:459   Name:M_FIRSTROWW UE_C:10  M_ID:93  C_ID_1:20337
Hour:12 Min:32 Sec:17 Ms    E_ID:459   Name:I_SECONDROW UE_C:10  M_ID:93  C_ID_1:20337

I answered a similar question. Have a look there : https://stackoverflow.com/questions/44998223/list-of-all-duplicate-columns-in-pandas/44998316#44998316 — cgte, Jul 09 '17 at 16:52

daytony · Accepted Answer · 2017-07-10T04:05:01.643

1) Some useful imports

import pandas as pd
import numpy as np
import datetime as dt
import itertools
import re

2) Import and clean data

df = pd.read_csv("data.csv", sep="|", header=None, names=["time", "mseid", "name", "uec", "mid", "cid"])
df["time"] = [dt.datetime.strptime(":".join(re.findall(r'\d+', time_string)), "%H:%M:%S") for time_string in df["time"]]
df["mseid"] = [mseid.split(":")[-1] for mseid in df["mseid"]]
df["name"] = [name.split(":")[-1] for name in df["name"]]
df["uec"] = [uec.split(":")[-1] for uec in df["uec"]]
df["mid"] = [mid.split(":")[-1] for mid in df["mid"]]
df["cid"] = [cid.split(":")[-1] for cid in df["cid"]]

3) Sort list by time and row name, group by row name and extract indexes for these groups. We can then zip these indexes to pair FIRSTROWs SECONDROWs

df_sorted = df.sort_values(["name", "time"]).groupby("name").groups.values()
>>> dict_values([Int64Index([10, 12, 6, 7, 8, 4, 0, 9, 11], dtype='int64'), Int64Index([13, 2, 5, 1, 3], dtype='int64')])

# https://stackoverflow.com/questions/12355442/converting-a-list-of-tuples-into-a-simple-flat-list
ordered = list(itertools.chain(*zip(*df_sorted)))
num_groups = int(len(ordered) / 2)
ordered += [ind for ind in df.index if ind not in ordered]
ordered
>>> [10, 13, 12, 2, 6, 5, 7, 1, 8, 3, 0, 4, 9, 11]


df = df.iloc[ordered]
df = df.reset_index()
del df['index']
df.head()

>>>     time    mseid   name    uec mid cid
0   1900-01-01 12:30:24 459 I_SECONDROW 10  93  20337
1   1900-01-01 12:30:26 500 M_FIRSTROWW 1   80  20110
2   1900-01-01 12:30:24 500 I_SECONDROW 1   80  20110
3   1900-01-01 12:30:31 459 M_FIRSTROWW 10  93  20337
4   1900-01-01 12:30:26 459 I_SECONDROW 10  93  203377

4) Create and add a column for row pairings

groups = [val for val in range(num_groups) for _ in [0, 1]]
remainder = len(df.index) - len(groups)
groups = groups + ["-" for i in range(remainder)]
df["pair"] = groups
groups

>>> [0, 0, 1, 1, 2, 2, 3, 3, 4, 4, '-', '-', '-', '-']

5) Group row pairs, find time difference and add column for time_delta

pairs = df.groupby("pair")["time"]
time_delta = []
for pair in pairs:
    if len(pair[1]) == 2:
        second, first = pair[1].values
        time_difference = abs(int((first - second)/1000000000)) # nanoseconds to seconds
        time_delta.append(time_difference)
time_delta = [val for val in time_delta for _ in [0, 1]]
remainder = len(df.index) - len(time_delta)
time_delta = time_delta + [np.NaN for i in range(remainder)]
df["time_delta"] = time_delta
df

>>>     time    mseid   name    uec mid cid pair    time_delta
0   1900-01-01 12:30:24 459 I_SECONDROW 10  93  20337   0   2.0
1   1900-01-01 12:30:26 500 M_FIRSTROWW 1   80  20110   0   2.0
2   1900-01-01 12:30:24 500 I_SECONDROW 1   80  20110   1   7.0
3   1900-01-01 12:30:31 459 M_FIRSTROWW 10  93  20337   1   7.0
4   1900-01-01 12:30:26 459 I_SECONDROW 10  93  20337   2   116.0

6) Finally, create a boolean mask to get all time_deltas <=5, then groupby('pair')

df[df.time_delta <=5].head().groupby("pair").head()

    time    mseid   name    uec mid cid pair    time_delta
0   1900-01-01 12:30:24 459 I_SECONDROW 10  93  20337   0   2.0
1   1900-01-01 12:30:26 500 M_FIRSTROWW 1   80  20110   0   2.0

Note timestamp default is year 1900, which is irrelevant since we are subtracting times on the same day. However you should use exact timestamp when creating data.

When making pairs there should be a validation check i.e FIRSTROW and SECONDROW should have the same value in mseid, mid , cid. — user3484464, Jul 11 '17 at 05:45
i.e make row pairs of FIRSTROW AND SECONDROW such that they have same value in mesid, mid,cid. — user3484464, Jul 11 '17 at 05:58
I see, that wasn't clear originally. I suggest reading up on [.groupby()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) and play around with it, should give you what you're looking for. You should read a little on the [split-apply-combine](https://pandas.pydata.org/pandas-docs/stable/groupby.html) process too as it is important in pandas. — daytony, Jul 14 '17 at 07:04

Combine two rows in pandas data frame having same values in multiple columns and comparing data in another column

1 Answers1