Looking for same values in more than one columns

Question

I have a data frame with the next information

      Syndromes                                 Genes
0   1p36 microdeletion syndrome     RP11-34P13.7; CICP27; AL627309.1; RP11-34P13.1...
1   Cri du Chat Syndrome            ICMT; LINC00337; HES3; GPR153; ACOT7; RP1-202O...
...

I would like to know if there are some genes involve in more one syndromews.

Is there an easy approach to do that?

I didn't get your question. Please provide a good sample of data with the expected output and explanation. — Pygirl, Mar 23 '21 at 11:31

score 1 · Answer 1 · answered Mar 23 '21 at 11:38

1

IIUC you want to get the occurence of Genes in various symptoms.

You can count the symptoms where a particular genes occur by 1st exploding and then groupby count.

df1 = df.assign(gene=df['Genes'].str.split('; ')).explode('gene')
df1.groupby('gene')['Syndromes'].count()

gene
ACOT7           1
AL627309.1      1
CICP27          2
GPR153          1
HES3            1
ICMT            1
LINC00337       1
RP1-202O        1
RP11-34P13.1    1
RP11-34P13.7    1
Name:       Syndromes, dtype: int64

if you want to list the name of symptoms then do :

df1.groupby('gene')['Syndromes'].apply(list)

gene
ACOT7                                  [1   Cri du Chat Syndrome]
AL627309.1                      [0   1p36 microdeletion syndrome]
CICP27          [0   1p36 microdeletion syndrome, 1   Cri du C...
GPR153                                 [1   Cri du Chat Syndrome]
HES3                                   [1   Cri du Chat Syndrome]
ICMT                                   [1   Cri du Chat Syndrome]
LINC00337                              [1   Cri du Chat Syndrome]
RP1-202O                               [1   Cri du Chat Syndrome]
RP11-34P13.1                    [0   1p36 microdeletion syndrome]
RP11-34P13.7                    [0   1p36 microdeletion syndrome]
Name:       Syndromes, dtype: object

answered Mar 23 '21 at 11:38

Pygirl

12,969
5
30
43

@ManoloDominguezBecerra: For using `explode` pandas version should be `>= 0.25` – Pygirl Mar 23 '21 at 11:42
I am doing some upgrading and then I will let you know if this works. – Manolo Dominguez Becerra Mar 23 '21 at 11:46
I have upgraded my pandas and I have the versionb pandas-1.1.3 but still getting the error AttributeError: 'DataFrame' object has no attribute 'explode' – Manolo Dominguez Becerra Mar 23 '21 at 11:55
@ManoloDominguezBecerra Do `pip freeze | grep pandas` and check the version I think you may need to restart your kernel if you are working on notebook – Pygirl Mar 23 '21 at 11:56
or try `df.assign(gene=df['Genes'].str.split('; ')).to_frame().explode('gene')` – Pygirl Mar 23 '21 at 11:58
1

@Pygirl Here is one way of doing without using `explode`, `pd.value_counts(np.hstack(df['Genes'].str.split('; ')))` – Shubham Sharma Mar 23 '21 at 12:03
1

Absolutely right I restarted my kernel and then it worked. Also the second suggestion by Shubhan Sharma worked as well. Thanks!! – Manolo Dominguez Becerra Mar 23 '21 at 12:08
@ShubhamSharma: what about `np.concatenate` (although almost same) :P https://stackoverflow.com/a/38203536/6660373 – Pygirl Mar 23 '21 at 12:14

score 1 · Answer 2 · answered Mar 23 '21 at 12:22

From your source data sample I see that you have actually only 2 columns:

Syndromes - the syndrome name,
Genes - a list of gene names involved in this syndrome (separated with '; ' string).

Assume that your source DataFrame contains:

    Syndromes                                           Genes
0  Syndrome_1    XXX-13.7; XXX-44.2; AL627309.1; RP11-34P13.1
1  Syndrome_2  ICMT; LINC00337; HES3; GPR153; ACOT7; RP1-202O
2  Syndrome_3                        ICMT; XXX-13.7; XXX-22.9
3  Syndrome_4                         XXX-13.7; GPR153; ACOT7
4  Syndrome_5                               XXX-44.2; RP1-202

The first step is to convert each string (containing the gene list) into an actual list:

df.Genes = df.Genes.str.split('; ')

Print the source df before and after this operation to see the difference.

Then, to see in which syndromes is involved each gene, filtering out single involvements:

Define a function generating a list of syndromes for a group of rows:

def synList(grp):
    if grp.index.size > 1:
        return grp.Syndromes.to_list()
    return np.nan

Generate the result:
```
result = df.explode(column='Genes').groupby('Genes').apply(synList).dropna()
```
Steps:
- df.explode(column='Genes') - Convert each row into a couple of rows, each with a single gene.
- groupby('Genes') - Group the above result by Genes column.
- apply(synList) - Apply the above function (to each group).
- dropna() - Drop rows for single occurrences.

For my source data the result is the following Series:

Genes
ACOT7                   [Syndrome_2, Syndrome_4]
GPR153                  [Syndrome_2, Syndrome_4]
ICMT                    [Syndrome_2, Syndrome_3]
XXX-13.7    [Syndrome_1, Syndrome_3, Syndrome_4]
XXX-44.2                [Syndrome_1, Syndrome_5]
dtype: object

The index (left column) is the gene name and the value (right column, containing a list) is a list of syndromes where the gene occurs.

Wow it looks like I am reading a manual with proper instructions mentioned. +1 — Pygirl, Mar 23 '21 at 12:28
The code I wrote is quite concise, but reading it by somebody with small experience in *Python* or *Pandas* can be difficult. So it is advisable to put some explanation concerning each step of such a chained instruction. — Valdi_Bo, Mar 23 '21 at 12:34

Looking for same values in more than one columns

2 Answers2