I am concatenating several dataframes of Mutational Data from different Samples. I know that there will be duplicates, i.e. several samples will be have the same common mutations. I want to remove extra duplicate rows of same mutation, and instead include a columns with all samples with that mutation. I don't think df.drop_duplicates() does this, in the same way that np.unique() can.
Simplified Example:
import pandas as pd
df = pd.DataFrame({"Chromosome":[1, 1, 1, 1, 1],
'Position': [100, 220,300,100,220],
"Gene":["CHD1","BRCA2","TP53","CHD1", "BRCA2"],
"SAMPLE":["A1","A2","A3","A4", "A5"]})
df
Output:
Chromosome Position Gene SAMPLE
0 1 100 CHD1 S1
1 1 220 BRCA2 S2
2 1 300 TP53 S3
3 1 100 CHD1 S4
4 1 220 BRCA2 S5
and I would like to end up with this:
df_new
Output:
Chromosome Position Gene SAMPLES Count
0 1 100 CHD1 [S1, S4] 2
1 1 220 BRCA2 [S2,S5] 2
2 1 300 TP53 S3 1
I'm sure there is some easy way to do this that I am missing.
Here is the way I've been doing in numpy (using inverse output from np.uniue(return_inverse=True). It works, but not so efficient.
Samples = array(master_df['Sample_ID'], dtype=str)
temp_array = array(master_df[master_df.columns[0:3]], dtype=str)
temp_unq, ind1, inv1, cnts1 = unique(temp_array, return_index= True, return_inverse=True, return_counts=True, axis=0)
s1 = [[] for i in cnts1]
for i in range(temp_unq.shape[0]):
whr = np.where(inv1==i)[0]
s1[i].append(list(Samples[whr]))
unq_combo = master_df.iloc[ind1]
unq_combo = unq_combo.reset_index(drop=True)
unq_combo['Counts'] =pd.Series(cnts1)
unq_combo['Samples#'] = pd.Series(s1)