1

I try to combine many pairs of rows when run the code one time. As my example shows, for two rows which can be combined, the rules are,

  • values in PT, DS, SC columns must be same.
  • time stamps in FS must be the closest pair.
  • combine on ID column (string) is like ID1,ID2.
  • combine on WT and CB column (number) is sum().
  • combine on FS is as the latest time.

My example is,

df0 = pd.DataFrame({'ID':['1001','1002','1003','1004','2001','2002','2003','2004','3001','3002','3003','3004','4001','4002','4003','4004','5001','5002','5003','5004','6001'],
                   'PT':['B','B','B','B','B','B','B','B','B','B','B','B','B','B','B','B','D','D','D','D','F'],
                   'DS':['AAA','AAA','AAA','AAA','AAA','AAA','AAA','AAA','AAB','AAB','AAB','AAB','AAB','AAB','AAB','AAB','AAA','AAA','AAA','AAB','AAB'],
                   'SC':['P1','P1','P1','P1','P2','P2','P2','P2','P1','P1','P1','P1','P2','P2','P2','P2','P1','P1','P1','P2','P2'],
                   'FS':['2020-10-16 00:00:00','2020-10-16 00:00:02','2020-10-16 00:00:03','2020-10-16 00:00:04','2020-10-16 00:00:00','2020-10-16 00:00:01','2020-10-16 00:00:02','2020-10-16 00:00:03','2020-10-16 00:00:00','2020-10-16 00:00:01','2020-10-16 00:00:05','2020-10-16 00:00:07','2020-10-16 00:00:01','2020-10-16 00:00:10','2020-10-16 00:10:00','2020-10-16 00:10:40','2020-10-16 00:00:00','2020-10-16 00:10:00','2020-10-16 00:00:40','2020-10-16 00:00:10','2020-10-16 00:00:05'],
                   'WT':[1,2,3,4,10,11,12,13,20,21,22,23,30,31,32,33,40,41,42,43,53],
                   'CB':[0.1,0.2,0.3,0.4,1,1.1,1.2,1.3,2,2.1,2.2,2.3,3,3.1,3.2,3.3,4,4.1,4.2,4.3,5.3]})

When run the code one time, the new dataframe df1 is,

df1 = pd.DataFrame({'ID':['1001,1002','1003,1004','2001,2002','2003,2004','3001,3002','3003,3004','4001,4002','4003,4004','5001,5002','5003','5004','6001'],
                   'PT':['B','B','B','B','B','B','B','B','D','D','D','F'],
                   'DS':['AAA','AAA','AAA','AAA','AAB','AAB','AAB','AAB','AAA','AAA','AAB','AAB'],
                   'SC':['P1','P1','P2','P2','P1','P1','P2','P2','P1','P1','P2','P2'],
                   'FS':['2020-10-16 00:00:02','2020-10-16 00:00:04','2020-10-16 00:00:01','2020-10-16 00:00:03','2020-10-16 00:00:01','2020-10-16 00:00:07','2020-10-16 00:00:10','2020-10-16 00:10:40','2020-10-16 00:10:00','2020-10-16 00:00:40','2020-10-16 00:00:10','2020-10-16 00:00:05'],
                   'WT':[3,7,21,25,41,45,61,65,81,42,43,53],
                   'CB':[0.3,0.7,2.1,2.5,4.1,4.5,6.1,6.5,8.1,4.2,4.3,5.3]})

When run the code again on df1, the new dataframe df2 is,

df2 = pd.DataFrame({'ID':['1001,1002,1003,1004','2001,2002,2003,2004','3001,3002,3003,3004','4001,4002,4003,4004','5001,5002,5003','5004','6001'],
                   'PT':['B','B','B','B','D','D','F'],
                   'DS':['AAA','AAA','AAB','AAB','AAA','AAB','AAB'],
                   'SC':['P1','P2','P1','P2','P1','P2','P2'],
                   'FS':['2020-10-16 00:00:04','2020-10-16 00:00:03','2020-10-16 00:00:07','2020-10-16 00:10:40','2020-10-16 00:10:00','2020-10-16 00:00:10','2020-10-16 00:00:05'],
                   'WT':[10,46,86,126,123,43,53],
                   'CB':[1,4.6,8.6,12.6,12.3,4.3,5.3]})

Here no more combines can be done on df2 because no any pair of rows meets the rules.

The reason is that I have memory limit and have to decrease the size of data without losing the info. So I try to bundle IDs which shares same features and happens close to each other. I plan to run the code multiple times until no more memory issue or no more possible combines.

franky
  • 45
  • 4

2 Answers2

0

This is a good place to use GroupBy operations.

My source was Wes McKinney's Python for Data Analysis.

df0['ID'] = df0.groupby([df0['PT'], df0['DS'], df0['SC']])['ID'].transform(lambda x: ','.join(x))

max_times = df0.groupby(['ID', 'PT', 'DS', 'SC'], as_index = False).max().drop(['WT', 'CB'], axis = 1)

sums_WT_CB = df0.groupby(['ID', 'PT', 'DS', 'SC'], as_index = False).sum()

df2 = pd.merge(max_times, sums_WT_CB, on=['ID', 'PT', 'DS', 'SC'])

This code just takes the most recent time for each unique grouping of the columns you specified. If there are other requirements for the FS column, you will have to modify this.

Code to concatenate the IDs came from: Concatenate strings from several rows using Pandas groupby

tr4nshum4n
  • 11
  • 2
  • Thank you for your answer. Your code works to get df2. However, I wish to run the code first time to get df1, and then run the code second time on df1 to get df2. For every time, I only want to combine two rows meeting the rules. Could you please add this part into your answer? – franky Oct 19 '20 at 14:36
-1

Perhaps there's something more straightforward (please comment if so :)

but the following seems to work:

def combine(data):
    return pd.DataFrame(
        {
            "ID": ",".join(map(str, data["ID"])),
            "PT": data["PT"].iloc[0],
            "DS": data["DS"].iloc[0],
            "SC": data["SC"].iloc[0],
            "WT": data["WT"].sum(),
            "CB": data["CB"].sum(),
            "FS": data["FS"].max(),
        },
        index=[0],
    ).reset_index(drop=True)

df_agg = (
    df.sort_values(["PT", "DS", "SC", "FS"])
    .groupby(["PT", "DS", "SC"])
    .apply(combine)
    .reset_index(drop=True)
)

returns

                    ID PT   DS  SC   WT    CB                   FS
0  1001,1002,1003,1004  B  AAA  P1   10   1.0  2020-10-16 00:00:04
1  2001,2002,2003,2004  B  AAA  P2   46   4.6  2020-10-16 00:00:03
2  3001,3002,3003,3004  B  AAB  P1   86   8.6  2020-10-16 00:00:07
3  4001,4002,4003,4004  B  AAB  P2  126  12.6  2020-10-16 00:10:40
4       5001,5003,5002  D  AAA  P1  123  12.3  2020-10-16 00:10:00
5                 5004  D  AAB  P2   43   4.3  2020-10-16 00:00:10
6                 6001  F  AAB  P2   53   5.3  2020-10-16 00:00:05
baxx
  • 3,956
  • 6
  • 37
  • 75
  • Thank you for your answer. Your code works after changing df to df0 to get df2. However, I wish to run code first time to get df1, and then run code second time on df1 to get df2. Could you please add that part into your answer? – franky Oct 19 '20 at 14:41
  • @franky sorry - that wasn't clear from the post, i'll see if i can have a look – baxx Oct 21 '20 at 21:51
  • it would be helpful if you made explicit the logic from df0 -> df1, and from df1 -> df2 – baxx Oct 21 '20 at 21:52
  • My plan is to sort the dataframe first as, df.sort_values(['PT', 'DS', 'SC', 'FS'], ascending=[True, True, True, True], inplace=True) Then combine every two rows so that PT, DS and SC are same and FS are the closest. But the problem is ID=5003 in df1 will be combined with ID=5003 in df1 which is not what I want. I want to get df1 when run the code once, and then get df2 when run the code on df1. – franky Oct 25 '20 at 20:32
  • Sorry for the wrong typing, ID=5003 and ID=5004 in df1 should not be combined. – franky Oct 25 '20 at 20:39
  • I think I need something like df.groupby(np.arange(len(df))//2) – franky Oct 25 '20 at 20:42