2

I have a dataframe df:

df:

        chr          gene_name
0        1           ARF3
1        1           ABC
2        1           ARF3, ENSG123
3        1           ENSG,ARF3
4        1           ANG
5        2           XVY
6        2           PQR
7        3           RST
8        4           TAC 

and a gene_list

gene_list = ['ARF3','ABC' ]

Now, I need to get the rows from the data frame (df) for which the gene name is either an exact match with elements in gene_list.

So, I tried:

df2 = df1[df.gene_name.isin(gene_list)]

I retrieved:

       chr           gene_name
0        1           ARF3
1        1           ABC

but what I am expecting is:

        chr          gene_name
0        1           ARF3
1        1           ABC
2        1           ARF3, ENSG123
3        1           ENSG,ARF3

so basically all the rows in the data frame where the element in gene_list is a substring of gene_name in the data frame.

I thought of using .contains() had it been I was looking the other way that is gene_name in the data frame would have been a substring on element in gene_list.

All the help appreciated

Engineero
  • 12,340
  • 5
  • 53
  • 75
user6475383
  • 53
  • 1
  • 4

1 Answers1

1

You can use contains with join all values with | (or):

gene_list = ['ARF3','ABC' ]

print ('|'.join(gene_list))
ARF3|ABC

print (df.gene_name.str.contains('|'.join(gene_list)))
0     True
1     True
2     True
3     True
4    False
5    False
6    False
7    False
8    False
Name: gene_name, dtype: bool

df2 = df[df.gene_name.str.contains('|'.join(gene_list))]
print (df2)
   chr     gene_name
0    1          ARF3
1    1           ABC
2    1  ARF3,ENSG123
3    1     ENSG,ARF3
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252