0

I have a data frame with the next information

      Syndromes                                 Genes
0   1p36 microdeletion syndrome     RP11-34P13.7; CICP27; AL627309.1; RP11-34P13.1...
1   Cri du Chat Syndrome            ICMT; LINC00337; HES3; GPR153; ACOT7; RP1-202O...
...

I would like to know if there are some genes involve in more one syndromews.

Is there an easy approach to do that?

  • 2
    I didn't get your question. Please provide a good sample of data with the expected output and explanation. – Pygirl Mar 23 '21 at 11:31

2 Answers2

1

IIUC you want to get the occurence of Genes in various symptoms.

You can count the symptoms where a particular genes occur by 1st exploding and then groupby count.

df1 = df.assign(gene=df['Genes'].str.split('; ')).explode('gene')
df1.groupby('gene')['Syndromes'].count()

gene
ACOT7           1
AL627309.1      1
CICP27          2
GPR153          1
HES3            1
ICMT            1
LINC00337       1
RP1-202O        1
RP11-34P13.1    1
RP11-34P13.7    1
Name:       Syndromes, dtype: int64

if you want to list the name of symptoms then do :

df1.groupby('gene')['Syndromes'].apply(list)

gene
ACOT7                                  [1   Cri du Chat Syndrome]
AL627309.1                      [0   1p36 microdeletion syndrome]
CICP27          [0   1p36 microdeletion syndrome, 1   Cri du C...
GPR153                                 [1   Cri du Chat Syndrome]
HES3                                   [1   Cri du Chat Syndrome]
ICMT                                   [1   Cri du Chat Syndrome]
LINC00337                              [1   Cri du Chat Syndrome]
RP1-202O                               [1   Cri du Chat Syndrome]
RP11-34P13.1                    [0   1p36 microdeletion syndrome]
RP11-34P13.7                    [0   1p36 microdeletion syndrome]
Name:       Syndromes, dtype: object
Pygirl
  • 12,969
  • 5
  • 30
  • 43
1

From your source data sample I see that you have actually only 2 columns:

  • Syndromes - the syndrome name,
  • Genes - a list of gene names involved in this syndrome (separated with '; ' string).

Assume that your source DataFrame contains:

    Syndromes                                           Genes
0  Syndrome_1    XXX-13.7; XXX-44.2; AL627309.1; RP11-34P13.1
1  Syndrome_2  ICMT; LINC00337; HES3; GPR153; ACOT7; RP1-202O
2  Syndrome_3                        ICMT; XXX-13.7; XXX-22.9
3  Syndrome_4                         XXX-13.7; GPR153; ACOT7
4  Syndrome_5                               XXX-44.2; RP1-202

The first step is to convert each string (containing the gene list) into an actual list:

df.Genes = df.Genes.str.split('; ')

Print the source df before and after this operation to see the difference.

Then, to see in which syndromes is involved each gene, filtering out single involvements:

  1. Define a function generating a list of syndromes for a group of rows:

    def synList(grp):
        if grp.index.size > 1:
            return grp.Syndromes.to_list()
        return np.nan
    
  2. Generate the result:

    result = df.explode(column='Genes').groupby('Genes').apply(synList).dropna()
    

    Steps:

    • df.explode(column='Genes') - Convert each row into a couple of rows, each with a single gene.
    • groupby('Genes') - Group the above result by Genes column.
    • apply(synList) - Apply the above function (to each group).
    • dropna() - Drop rows for single occurrences.

For my source data the result is the following Series:

Genes
ACOT7                   [Syndrome_2, Syndrome_4]
GPR153                  [Syndrome_2, Syndrome_4]
ICMT                    [Syndrome_2, Syndrome_3]
XXX-13.7    [Syndrome_1, Syndrome_3, Syndrome_4]
XXX-44.2                [Syndrome_1, Syndrome_5]
dtype: object

The index (left column) is the gene name and the value (right column, containing a list) is a list of syndromes where the gene occurs.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
  • Wow it looks like I am reading a manual with proper instructions mentioned. +1 – Pygirl Mar 23 '21 at 12:28
  • 1
    The code I wrote is quite concise, but reading it by somebody with small experience in *Python* or *Pandas* can be difficult. So it is advisable to put some explanation concerning each step of such a chained instruction. – Valdi_Bo Mar 23 '21 at 12:34