My goal is to group by 'Patient' column and output each patient in a single row =, followed by multiple columns from my input file in sequence. Ideally, I would prefer to have a counter in the column heads. Here is my input file example:
Patient Test panel gene alteration
1 A 54 APC E1345*
1 B 54 TP53 Y205H
1 C 54 APC V2278V
2 A 54 KRAS G12D
2 B 54 PTEN L25L
3 A 54 KRAS G13D
3 C 54 TP53 C141W
3 C 54 APC R876*
3 A 54 ERBB2 L663P
Expected output where it groupby 'Patient' Column, then loops over columns "Test", "gene" and "alteration" to create the following:
Patient Test gene alteration Test gene alteration Test gene alteration Test gene alteration Test gene alteration
1 A APC E1345* B TP53 Y205H C TP53 Y205H
2 A KRAS G12D B PTEN L25L
3 A KRAS G13D C TP53 C141W C APC R876* A ERBB2 L663P A ERBB2 L663P
Ideally, it would be nice to # the Tests/gene/alteration i.e Test_1 gene_1 alteration_1 etc. However, I realize that made things complicated.
here is what I tried and I am unable to get the output of interest
df = pd.read_table(args.md, sep="\t")
df=pd.DataFrame(df) #I used an input file
values=grouped['gene'].apply('\t'.join).reset_index()
the output for this function 1) did not let me combine more than 'gene' column so if I use ['gene', 'Test'] it wouldn't give the desired output and 2) the joining '\t' would literally be output as '\t' instead of a tab
So then I tried
grouped=df.groupby('Patient')
print grouped
values=grouped['gene'].apply('\t'.join).reset_index()
print values
id_df = grouped['Test'].apply(lambda x: pd.Series(x.values)).unstack()
id_df = id_df.rename(columns={i: 'Test{}'.format(i + 1) for i in range(id_df.shape[1])})
result = pd.concat([id_df, values], axis=1)
print(result)
The structure of this second attempt did not conform what with I needed but it did provide me with a counter
I was wondering if anyone can provide some insight to get the desired output. I used the above commands from looking on SO but was not able to troubleshoot.