How to explicitly find string using str.contains() in a loop?

Question

I am searching particular strings in first column using str.contain() in a big file. There are some cases are reported even if they partially match with the provided string. For example:

My file structure:

miRNA,Gene,Species_ID,PCT
miR-17-5p/331-5p,AAK1,9606,0.94
miR-17-5p/31-5p,Gnp,9606,0.92
miR-17-5p/130-5p,AAK1,9606,0.94
miR-17-5p/30-5p,Gnp,9606,0.94

when I run my code search

DE_miRNAs = ['31-5p', '150-3p'] #the actual list is much bigger
for miRNA in DE_miRNAs:
    targets = pd.read_csv('my_file.csv')
    new_df = targets.loc[targets['miRNA'].str.contains(miRNA)]

I am expecting to only get only the second raw:

miR-17-5p/31-5p,Gnp,9606,0.92

but I de get both first and second raw - 331-5p come in the result too which should not:

miR-17-5p/331-5p,AAK1,9606,0.94
miR-17-5p/31-5p,Gnp,9606,0.92

Is there a way to make the str.contains() more specific? There is a suggestion here but how I can implement it to a for loop? str.contains(r"\bmiRNA\b") does not work.

Thank you.

Sorry, do you have to use Pandas here? I have an impression all you need is to read a CSV file and only take the lines where the `DE_miRNAs` match as a whole word. Right? Or do you want to say `Series.str.contains()` instead of `str.contains()`? — Wiktor Stribiżew, Mar 04 '22 at 10:23
Yes exactly but after having the rows ( there can be several rows for a pattern because Gene column can be different) I need to have the top 200 based on ascending PCT column. — Apex, Mar 04 '22 at 10:26

score 0 · Answer 1 · answered Mar 04 '22 at 08:57

Use str.contains with a regex alternation which is surrounded by word boundaries on both sides:

DE_miRNAs = ['31-5p', '150-3p']
regex = r'\b(' + '|'.join(DE_miRNAs) + r')\b'

targets = pd.read_csv('my_file.csv')
new_df = targets.loc[targets['miRNA'].str.contains(regex)]

score 0 · Answer 2 · answered Mar 04 '22 at 09:02

contains is a function that takes a regex pattern as an argument. You should be more explicit about the regex pattern you are using.

In your case, I suggest you use /31-5p instead of 31-5p:

DE_miRNAs = ['31-5p', '150-3p'] #the actual list is much bigger
for miRNA in DE_miRNAs:
    targets = pd.read_csv('my_file.csv')
    new_df = targets.loc[targets['miRNA'].str.contains("/" + miRNA)]

How to explicitly find string using str.contains() in a loop?

2 Answers2