Match standalone words in string in DataFrame column

Question

I have a dataframe such as:

import pandas as pd
import re

df = pd.DataFrame({"Name": ["D1", "D2", "D3", "D4", "M1", "M2", "M3"], 
                  "Requirements": ["3 meters|2/3 meters|3.5 meters",
                                   "3 meters",
                                   "3/5 meters|3 meters",
                                   "2/3 meters",
                                   "steel|g1_steel",
                                   "steel",
                                   "g1_steel"]})

dataframe df

  Name                    Requirements
0   D1  3 meters|2/3 meters|3.5 meters
1   D2                        3 meters
2   D3             3/5 meters|3 meters
3   D4                      2/3 meters
4   M1                  steel|g1_steel
5   M2                           steel
6   M3                        g1_steel

I have a list of words req_list = ['3 meters', 'steel'] and I am trying to extract rows from df where the strings in column Requirements contain standalone words that are from req_list. This is what I have done:

This one prints just D2 and M2

df[df.Requirements.apply(lambda x: any(len(x.replace(y, '')) == 0 for y in req_list))]

This one prints all rows

df[df['Requirements'].str.contains(fr"\b(?:{'|'.join(req_list)})\b")]

My desired result is as follows:

  Name                    Requirements
0   D1  3 meters|2/3 meters|3.5 meters
1   D2                        3 meters
2   D3             3/5 meters|3 meters
4   M1                  steel|g1_steel
5   M2                           steel

In this desired output, D4 and M3 are eliminated because they do not have words from req_list as standalone strings. Is there any way to achieve this preferably in an one-liner without using custom functions?

EDIT

The strings in the column Requirements can come in any pattern such as:

    Name                    Requirements
0   D1    3 meters|2/3 meters|3.5 meters
1   D2                          3 meters
2   D3               3/5 meters|3 meters
3   D4                        2/3 meters
4   D5                       3::3 meters # New pattern which needs to be eliminated
5   D6                        3.3 meters # New pattern which needs to be eliminated
6   D7                        3?3 meters # New pattern which needs to be eliminated
7   M1                    steel|g1_steel
8   M2                             steel
9   M3                          g1_steel

Wiktor Stribiżew · Accepted Answer · 2022-06-16T20:29:54.607

3

Since you want to make sure you do not match 3 meters that is preceded with a digit + /, you may add a (?<!\d/) negative lookbehind after the intial word boundary:

df[df['Requirements'].str.contains(fr"\b(?<!\d/)(?:{'|'.join(req_list)})\b")]

Output:

  Name                    Requirements
0   D1  3 meters|2/3 meters|3.5 meters
1   D2                        3 meters
2   D3             3/5 meters|3 meters
4   M1                  steel|g1_steel
5   M2                           steel

See the regex demo.

Notes

Since req_list contains phrases (mutiword strings) you might have to sort the items by length in the descending order before joining with the | OR operator, so you'd better use fr"\b(?<!\d/)(?:{'|'.join(sorted(req_list, key=len, reverse=True))})\b" as regex
If the req_list ever contains items with special chars you should also use adaptive dynamic word boundaries, i.e. fr"(?!\B\w)(?<!\d/)(?:{'|'.join(sorted(map(re.escape, req_list), key=len, reverse=True))})(?<!\w\B)".

edited Jun 16 '22 at 20:29

answered Jun 16 '22 at 20:24

Wiktor Stribiżew

607,720
39
448
563

1

Good one! Thumbs up!!! – Naveed Jun 16 '22 at 20:57
Thank you so much for this extremely crisp and clear solution! I am so sorry but I forgot to mention that the strings in `Requirements` column can come in any pattern which I have added as an edit in question. – Animeartist Jun 16 '22 at 21:16
@Animeartist Try `\b(?<!\d[^\w\s])(?<!\d[^\w\s][^\w\s])(?:3 meters|steel)\b`, see [the regex demo](https://regex101.com/r/rjscCL/2). – Wiktor Stribiżew Jun 16 '22 at 21:16
It works for most cases but fails for cases like `.3 meters` – Animeartist Jun 16 '22 at 21:26
@Animeartist What do you consider to be the match start? – Wiktor Stribiżew Jun 16 '22 at 21:27
Match start could be anything like `.3 meters`, `d-3 meters` and so on – Animeartist Jun 16 '22 at 21:29
@Animeartist No, I mean the left-hand match boundary. I have had a glance, and now think you want to only start a match 1) at the start of string, 2) after a `|` char. So, try `\b(?<![^\s|])(?:3 meters|steel)\b`. See [this regex demo](https://regex101.com/r/rjscCL/3). – Wiktor Stribiżew Jun 16 '22 at 21:30

score 1 · Answer 2 · answered Jun 16 '22 at 20:54

1

Here is one more way to do it

def chk(row):
    for r in row:
        if r.strip() in req_list:
            return True
    return False

df[df.assign(lst = df['Requirements'].str.split('|'))['lst'].apply(chk) == True]


    Name    Requirements
0   D1      3 meters|2/3 meters|3.5 meters
1   D2      3 meters
2   D3      3/5 meters|3 meters
4   M1      steel|g1_steel
5   M2      steel

answered Jun 16 '22 at 20:54

Naveed

11,495
2
14
21

Thank you so much, this is indeed a very nice solution that covers all usecases but I was hoping for it to be an one-liner since I am refactoring the codebase and can't make too many changes – Animeartist Jun 16 '22 at 21:14

Match standalone words in string in DataFrame column

2 Answers2