From your source data sample I see that you have actually only 2 columns:
- Syndromes - the syndrome name,
- Genes - a list of gene names involved in this syndrome (separated
with '; ' string).
Assume that your source DataFrame contains:
Syndromes Genes
0 Syndrome_1 XXX-13.7; XXX-44.2; AL627309.1; RP11-34P13.1
1 Syndrome_2 ICMT; LINC00337; HES3; GPR153; ACOT7; RP1-202O
2 Syndrome_3 ICMT; XXX-13.7; XXX-22.9
3 Syndrome_4 XXX-13.7; GPR153; ACOT7
4 Syndrome_5 XXX-44.2; RP1-202
The first step is to convert each string (containing the gene list)
into an actual list:
df.Genes = df.Genes.str.split('; ')
Print the source df before and after this operation to see the difference.
Then, to see in which syndromes is involved each gene, filtering out single
involvements:
Define a function generating a list of syndromes for a group of rows:
def synList(grp):
if grp.index.size > 1:
return grp.Syndromes.to_list()
return np.nan
Generate the result:
result = df.explode(column='Genes').groupby('Genes').apply(synList).dropna()
Steps:
df.explode(column='Genes')
- Convert each row into a couple of
rows, each with a single gene.
groupby('Genes')
- Group the above result by Genes column.
apply(synList)
- Apply the above function (to each group).
dropna()
- Drop rows for single occurrences.
For my source data the result is the following Series:
Genes
ACOT7 [Syndrome_2, Syndrome_4]
GPR153 [Syndrome_2, Syndrome_4]
ICMT [Syndrome_2, Syndrome_3]
XXX-13.7 [Syndrome_1, Syndrome_3, Syndrome_4]
XXX-44.2 [Syndrome_1, Syndrome_5]
dtype: object
The index (left column) is the gene name and the value (right column,
containing a list) is a list of syndromes where the gene occurs.