-2

I am searching particular strings in first column using str.contain() in a big file. There are some cases are reported even if they partially match with the provided string. For example:

My file structure:

miRNA,Gene,Species_ID,PCT
miR-17-5p/331-5p,AAK1,9606,0.94
miR-17-5p/31-5p,Gnp,9606,0.92
miR-17-5p/130-5p,AAK1,9606,0.94
miR-17-5p/30-5p,Gnp,9606,0.94 

when I run my code search

DE_miRNAs = ['31-5p', '150-3p'] #the actual list is much bigger
for miRNA in DE_miRNAs:
    targets = pd.read_csv('my_file.csv')
    new_df = targets.loc[targets['miRNA'].str.contains(miRNA)]

I am expecting to only get only the second raw:

miR-17-5p/31-5p,Gnp,9606,0.92

but I de get both first and second raw - 331-5p come in the result too which should not:

miR-17-5p/331-5p,AAK1,9606,0.94
miR-17-5p/31-5p,Gnp,9606,0.92

Is there a way to make the str.contains() more specific? There is a suggestion here but how I can implement it to a for loop? str.contains(r"\bmiRNA\b") does not work.

Thank you.

Apex
  • 1,055
  • 4
  • 22
  • Sorry, do you have to use Pandas here? I have an impression all you need is to read a CSV file and only take the lines where the `DE_miRNAs` match as a whole word. Right? Or do you want to say `Series.str.contains()` instead of `str.contains()`? – Wiktor Stribiżew Mar 04 '22 at 10:23
  • Yes exactly but after having the rows ( there can be several rows for a pattern because Gene column can be different) I need to have the top 200 based on ascending PCT column. – Apex Mar 04 '22 at 10:26
  • Does https://ideone.com/l4fqjL help? – Wiktor Stribiżew Mar 04 '22 at 11:04

2 Answers2

0

Use str.contains with a regex alternation which is surrounded by word boundaries on both sides:

DE_miRNAs = ['31-5p', '150-3p']
regex = r'\b(' + '|'.join(DE_miRNAs) + r')\b'

targets = pd.read_csv('my_file.csv')
new_df = targets.loc[targets['miRNA'].str.contains(regex)]
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
0

contains is a function that takes a regex pattern as an argument. You should be more explicit about the regex pattern you are using.

In your case, I suggest you use /31-5p instead of 31-5p:

DE_miRNAs = ['31-5p', '150-3p'] #the actual list is much bigger
for miRNA in DE_miRNAs:
    targets = pd.read_csv('my_file.csv')
    new_df = targets.loc[targets['miRNA'].str.contains("/" + miRNA)]
TheFaultInOurStars
  • 3,464
  • 1
  • 8
  • 29