I have a dataframe like this:
peakID cytoband start end length 10.388_116 10.193_156 10.401_184 10.214_385
0 Amp_2q37.3_chr2:237990001-242193529 2q37.3 237990001 242193529 4203528 1 0 0 0
1 Del_2q37.3_chr2:226990001-242193529 2q37.3 226990001 242193529 15203528 -1 0 0 0
Notice how peakID
is different, but cytoband
is not. I need to unpivot this table (using a function from pyjanitor
) without keeping peakID
. Currently I do:
import pandas as pd
import pyjanitor
from natsort import natsort_keygen
table = (
table
.drop(columns="peakID")
.pivot_longer(index=["cytoband", "start", "end", "length"],
names_to="sample", values_to="state")
.sort_values(["cytoband", "sample"], key=natsort_keygen())
.remove_columns(["length", "start", "end"])
.set_index("cytoband")
)
And the end result looks like this:
table.loc["2q37.3", :]
Out[36]:
sample state
cytoband
2q37.3 10.193_156 0
2q37.3 10.193_156 0
2q37.3 10.214_385 0
2q37.3 10.214_385 0
2q37.3 10.388_116 1
2q37.3 10.388_116 -1
2q37.3 10.401_184 0
2q37.3 10.401_184 0
The problem lies in the fact that if cytoband
is duplicated in different peakID
s, the resulting table will have the two records (state
) for each sample mixed up (as they don't have the relevant unique ID anymore).
The idea would be to suffix the duplicate records across distinct peakIDs (e.g. "2q37.3_A", "2q37.3_B", but I'm not sure on how to do that with groupby
or pandas in general as I need information from more than one group.
What's the cleanest solution to do this? Existing solutions (or this one) don't really fit.