Identifying duplicates and corresponding indices

Question

I am concatenating several dataframes of Mutational Data from different Samples. I know that there will be duplicates, i.e. several samples will be have the same common mutations. I want to remove extra duplicate rows of same mutation, and instead include a columns with all samples with that mutation. I don't think df.drop_duplicates() does this, in the same way that np.unique() can.

Simplified Example:

import pandas as pd
df = pd.DataFrame({"Chromosome":[1, 1, 1, 1, 1],
               'Position': [100, 220,300,100,220],
               "Gene":["CHD1","BRCA2","TP53","CHD1", "BRCA2"], 
               "SAMPLE":["A1","A2","A3","A4", "A5"]})
df
Output:
     Chromosome Position Gene   SAMPLE
 0        1     100      CHD1      S1
 1        1     220      BRCA2     S2
 2        1     300      TP53      S3
 3        1     100      CHD1      S4
 4        1     220      BRCA2     S5

and I would like to end up with this:

df_new 
Output:
     Chromosome Position Gene   SAMPLES     Count 
 0        1     100      CHD1      [S1, S4]    2
 1        1     220      BRCA2     [S2,S5]     2
 2        1     300      TP53      S3          1

I'm sure there is some easy way to do this that I am missing.

Here is the way I've been doing in numpy (using inverse output from np.uniue(return_inverse=True). It works, but not so efficient.

Samples = array(master_df['Sample_ID'], dtype=str)
temp_array = array(master_df[master_df.columns[0:3]], dtype=str)
temp_unq, ind1, inv1, cnts1 = unique(temp_array, return_index= True, return_inverse=True, return_counts=True, axis=0)
s1 = [[] for i in cnts1]
for i in range(temp_unq.shape[0]): 
        whr = np.where(inv1==i)[0]
        s1[i].append(list(Samples[whr]))
    unq_combo = master_df.iloc[ind1]
    unq_combo = unq_combo.reset_index(drop=True)
    unq_combo['Counts'] =pd.Series(cnts1)
    unq_combo['Samples#'] = pd.Series(s1)

score 1 · Accepted Answer · answered Mar 06 '19 at 20:48

1

Use groupby and agg:

df.groupby(['Chromosome', 'Position', 'Gene']).SAMPLE.agg([list, 'count'])
                               list  count
Chromosome Position Gene                  
1          100      CHD1   [S1, S4]      2
           220      BRCA2  [S2, S5]      2
           300      TP53       [S3]      1

(df.groupby(['Chromosome', 'Position', 'Gene']).SAMPLE
   .agg([list, 'count'])
   .reset_index())

   Chromosome  Position   Gene      list  count
0           1       100   CHD1  [S1, S4]      2
1           1       220  BRCA2  [S2, S5]      2
2           1       300   TP53      [S3]      1

answered Mar 06 '19 at 20:48

cs95

379,657
97
704
746

Thank you! this is like 90% of what I was looking for. One more thing: Do you know how to convert this to the same format as my original DataFrame? To make a df with Chromosome, Position, Gene , list and count all columns? – Danny Rabiz Mar 06 '19 at 20:56
@DannyRabiz See the second code sample after the line break? You can simply rename the columns after that. – cs95 Mar 06 '19 at 20:58
Much appreciated. – Danny Rabiz Mar 06 '19 at 20:59

edesz · Answer 2 · 2019-03-06T21:39:36.223

I used groupby with an aggregation dict to return the groups in a list (see this SO post)

Create data (per code in OP)

df = pd.DataFrame({"Chromosome":[1, 1, 1, 1, 1],
               'Position': [100, 220,300,100,220],
               "Gene":["CHD1","BRCA2","TP53","CHD1", "BRCA2"], 
               "SAMPLE":["A1","A2","A3","A4", "A5"]})
print(df)
   Chromosome  Position   Gene SAMPLE
0           1       100   CHD1     A1
1           1       220  BRCA2     A2
2           1       300   TP53     A3
3           1       100   CHD1     A4
4           1       220  BRCA2     A5

Perform groupby using aggregation dict

agg_dict = {'SAMPLE':[list, 'count']}
grouped = grouped = (
    df.groupby(['Chromosome','Position','Gene'], as_index=False)
    .agg(agg_dict)
    )
grouped.columns = grouped.columns.map(' '.join).str.strip()
print(grouped)

   Chromosome  Position   Gene SAMPLE list  SAMPLE count
0           1       100   CHD1    [A1, A4]             2
1           1       220  BRCA2    [A2, A5]             2
2           1       300   TP53        [A3]             1

EDIT

Modified per changes to sample data in the OP.

Identifying duplicates and corresponding indices

2 Answers2