0

I am very new to python pandas. I have a sorted pandas data frame with 10k+ rows. Here is the sample data frame:

Example:

             0         1                  2       3        4             5

Hour:12 Min:31 Sec:24 Ms    E_ID:459   Name:I_SECONDROW UE_C:10  M_ID:93  C_ID_1:20337      
Hour:12 Min:32 Sec:33 Ms    E_ID:459   Name:M_FIRSTROWW UE_C:10  M_ID:93  C_ID_1:20337      
Hour:12 Min:30 Sec:31 Ms    E_ID:459   Name:M_FIRSTROWW UE_C:10  M_ID:93  C_ID_1:20337      
Hour:12 Min:32 Sec:33 Ms    E_ID:459   Name:M_FIRSTROWW UE_C:10  M_ID:93  C_ID_1:20337         
Hour:12 Min:31 Sec:19 Ms    E_ID:459   Name:I_SECONDROW UE_C:10  M_ID:93  C_ID_1:20337      
Hour:12 Min:32 Sec:22 Ms    E_ID:459   Name:M_FIRSTROWW UE_C:10  M_ID:93  C_ID_1:20337     
Hour:12 Min:30 Sec:26 Ms    E_ID:459   Name:I_SECONDROW UE_C:10  M_ID:93  C_ID_1:20337     
Hour:12 Min:30 Sec:26 Ms    E_ID:459   Name:I_SECONDROW UE_C:10  M_ID:93  C_ID_1:20337          
Hour:12 Min:30 Sec:26 Ms    E_ID:459   Name:I_SECONDROW UE_C:10  M_ID:93  C_ID_1:20337          
Hour:12 Min:32 Sec:17 Ms    E_ID:459   Name:I_SECONDROW UE_C:10  M_ID:93  C_ID_1:20337           
Hour:12 Min:30 Sec:24 Ms    E_ID:459   Name:I_SECONDROW UE_C:10  M_ID:93  C_ID_1:20337            
Hour:12 Min:32 Sec:46 Ms    E_ID:459   Name:I_SECONDROW UE_C:9   M_ID:93  C_ID_1:20337          
Hour:12 Min:30 Sec:24 Ms    E_ID:500   Name:I_SECONDROW UE_C:1   M_ID:80  C_ID_1:20110         
Hour:12 Min:30 Sec:26 Ms    E_ID:500   Name:M_FIRSTROWW UE_C:1   M_ID:80  C_ID_1:20110      

Now I want to combine 2 rows(pair) with NAME as M_FIRSTROWW & I_SECONDROW and having same data in column 1, 3, 4, 5.

selected pairs should have Time difference less than or equal to 5 sec only.

Expected output:

Hour:12 Min:30 Sec:24 Ms    E_ID:500   Name:I_SECONDROW UE_C:1   M_ID:80  C_ID_1:20110         
Hour:12 Min:30 Sec:26 Ms    E_ID:500   Name:M_FIRSTROWW UE_C:1   M_ID:80  C_ID_1:20110


Hour:12 Min:30 Sec:31 Ms    E_ID:459   Name:M_FIRSTROWW UE_C:10  M_ID:93  C_ID_1:20337 
Hour:12 Min:30 Sec:26 Ms    E_ID:459   Name:I_SECONDROW UE_C:10  M_ID:93  C_ID_1:20337 


Hour:12 Min:32 Sec:22 Ms    E_ID:459   Name:M_FIRSTROWW UE_C:10  M_ID:93  C_ID_1:20337
Hour:12 Min:32 Sec:17 Ms    E_ID:459   Name:I_SECONDROW UE_C:10  M_ID:93  C_ID_1:20337  
cs95
  • 379,657
  • 97
  • 704
  • 746
  • I answered a similar question. Have a look there : https://stackoverflow.com/questions/44998223/list-of-all-duplicate-columns-in-pandas/44998316#44998316 – cgte Jul 09 '17 at 16:52

1 Answers1

1

1) Some useful imports

import pandas as pd
import numpy as np
import datetime as dt
import itertools
import re

2) Import and clean data

df = pd.read_csv("data.csv", sep="|", header=None, names=["time", "mseid", "name", "uec", "mid", "cid"])
df["time"] = [dt.datetime.strptime(":".join(re.findall(r'\d+', time_string)), "%H:%M:%S") for time_string in df["time"]]
df["mseid"] = [mseid.split(":")[-1] for mseid in df["mseid"]]
df["name"] = [name.split(":")[-1] for name in df["name"]]
df["uec"] = [uec.split(":")[-1] for uec in df["uec"]]
df["mid"] = [mid.split(":")[-1] for mid in df["mid"]]
df["cid"] = [cid.split(":")[-1] for cid in df["cid"]]

3) Sort list by time and row name, group by row name and extract indexes for these groups. We can then zip these indexes to pair FIRSTROWs SECONDROWs

df_sorted = df.sort_values(["name", "time"]).groupby("name").groups.values()
>>> dict_values([Int64Index([10, 12, 6, 7, 8, 4, 0, 9, 11], dtype='int64'), Int64Index([13, 2, 5, 1, 3], dtype='int64')])

# https://stackoverflow.com/questions/12355442/converting-a-list-of-tuples-into-a-simple-flat-list
ordered = list(itertools.chain(*zip(*df_sorted)))
num_groups = int(len(ordered) / 2)
ordered += [ind for ind in df.index if ind not in ordered]
ordered
>>> [10, 13, 12, 2, 6, 5, 7, 1, 8, 3, 0, 4, 9, 11]


df = df.iloc[ordered]
df = df.reset_index()
del df['index']
df.head()

>>>     time    mseid   name    uec mid cid
0   1900-01-01 12:30:24 459 I_SECONDROW 10  93  20337
1   1900-01-01 12:30:26 500 M_FIRSTROWW 1   80  20110
2   1900-01-01 12:30:24 500 I_SECONDROW 1   80  20110
3   1900-01-01 12:30:31 459 M_FIRSTROWW 10  93  20337
4   1900-01-01 12:30:26 459 I_SECONDROW 10  93  203377

4) Create and add a column for row pairings

groups = [val for val in range(num_groups) for _ in [0, 1]]
remainder = len(df.index) - len(groups)
groups = groups + ["-" for i in range(remainder)]
df["pair"] = groups
groups

>>> [0, 0, 1, 1, 2, 2, 3, 3, 4, 4, '-', '-', '-', '-']

5) Group row pairs, find time difference and add column for time_delta

pairs = df.groupby("pair")["time"]
time_delta = []
for pair in pairs:
    if len(pair[1]) == 2:
        second, first = pair[1].values
        time_difference = abs(int((first - second)/1000000000)) # nanoseconds to seconds
        time_delta.append(time_difference)
time_delta = [val for val in time_delta for _ in [0, 1]]
remainder = len(df.index) - len(time_delta)
time_delta = time_delta + [np.NaN for i in range(remainder)]
df["time_delta"] = time_delta
df

>>>     time    mseid   name    uec mid cid pair    time_delta
0   1900-01-01 12:30:24 459 I_SECONDROW 10  93  20337   0   2.0
1   1900-01-01 12:30:26 500 M_FIRSTROWW 1   80  20110   0   2.0
2   1900-01-01 12:30:24 500 I_SECONDROW 1   80  20110   1   7.0
3   1900-01-01 12:30:31 459 M_FIRSTROWW 10  93  20337   1   7.0
4   1900-01-01 12:30:26 459 I_SECONDROW 10  93  20337   2   116.0

6) Finally, create a boolean mask to get all time_deltas <=5, then groupby('pair')

df[df.time_delta <=5].head().groupby("pair").head()

    time    mseid   name    uec mid cid pair    time_delta
0   1900-01-01 12:30:24 459 I_SECONDROW 10  93  20337   0   2.0
1   1900-01-01 12:30:26 500 M_FIRSTROWW 1   80  20110   0   2.0

Note timestamp default is year 1900, which is irrelevant since we are subtracting times on the same day. However you should use exact timestamp when creating data.

daytony
  • 398
  • 3
  • 9
  • Thank you so much Dermot McGrath ..!! – user3484464 Jul 10 '17 at 09:36
  • When making pairs there should be a validation check i.e FIRSTROW and SECONDROW should have the same value in mseid, mid , cid. – user3484464 Jul 11 '17 at 05:45
  • i.e make row pairs of FIRSTROW AND SECONDROW such that they have same value in mesid, mid,cid. – user3484464 Jul 11 '17 at 05:58
  • I see, that wasn't clear originally. I suggest reading up on [.groupby()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) and play around with it, should give you what you're looking for. You should read a little on the [split-apply-combine](https://pandas.pydata.org/pandas-docs/stable/groupby.html) process too as it is important in pandas. – daytony Jul 14 '17 at 07:04