Pandas extract information which starts with [\s\d_/] and ends in [\s\d_/]

Question

I am trying to extract set of keywords such as ['lemon', 'apple', 'coconut'] etc. from the paths such as "\var\prj\lemon_123\xyz", "\var\prj\123_apple\coconut", "\var\prj\lemonade\coconutapple", "\var\prj\apple\lemon"

The expected output is little complex:

Paths	MatchedKeywords
"/var/prj/lemon_123/xyz"	lemon
"/var/prj/123_apple/coconut"	apple, coconut
"/var/prj/lemonade/coconutapple"
"/var/prj/apple/lemon"	apple, lemon

keep in mind that the third row does not have the exact word which start with /, \s, \d or _ thats why there is no match. The regular expression is kind of like this: \s\d_/[\s\d_/]. I tried using:

df['Paths'].str.findall(r'[^\s\d_/]lemon|apple|coconut[\s\d_/$]', flags=re.IGNORECASE)

But it is still showing 'lemon' and 'coconut' in the third row.

Thank you in advance.

Try matching on word boundaries (`\b`) – Robert May 21 '21 at 19:32 — Robert, May 21 '21 at 19:32

score 1 · Answer 1 · answered May 21 '21 at 18:41

1

Your regex is not correct for what you're looking to match, which is easy to see with visualization tools like https://regexper.com/ (no affiliation; just grabbed the top Google result).

You have: [^\s\d_/]lemon|apple|coconut[\s\d_/$]

$[^\s\d_/]lemon|apple|coconut[\s\d_/$]$

but likely want something like: [\s\d_/](lemon|apple|coconut)[\s\d_/]

answered May 21 '21 at 18:41

Randy

14,349
2
36
42

Now it is only matching (finding) the first keyword only. From the second row, it is only matching 'coconut' and not 'apple'. Thanks. – Nayan Desale May 21 '21 at 18:49
Do you have that backwards? The regex I shared should find `apple` but not `coconut` since coconut doesn't end with `[\s\d_/]`. – Randy May 21 '21 at 18:58
Yes you are right. It is finding apple and not coconut. I am sorry for my wrong comment. But how can we get both of them? and with that specific regex? – Nayan Desale May 21 '21 at 19:04

score 1 · Accepted Answer · answered May 21 '21 at 22:01

You can use

df['Paths'].str.findall(r'(?<![^\W_])(?:lemon|apple|coconut)(?![^\W_])').str.join(", ")
df['Paths'].str.findall(r'(?<![^\W\d_])(?:lemon|apple|coconut)(?![^\W\d_])').str.join(", ")

See the regex demo (and regex demo #2), the regex matches

(?<![^\W_]) - a location that is not immediately preceded with a char other than a non-word char and an underscore (it is a left-hand word boundary with the _ subtracted from it)
(?:lemon|apple|coconut) - a non-capturing group matching any of the words inside the group
(?![^\W_]) - a location that is not immediately followed with a char other than a non-word char and an underscore (it is a right-hand word boundary with the _ subtracted from it).

If you use (?<![^\W\d_]) and (?![^\W\d_]) your word boundaries will be letter boundaries, i.e. it will be \b with digits and underscore subtracted from it. See the Python demo:

import pandas as pd
df = pd.DataFrame({"Paths":["/var/prj/lemon_123/xyz", "/var/prj/123_apple/coconut", "/var/prj/lemonade/coconutapple", "/var/prj/apple/lemon"]})
df['Paths'].str.findall(r'(?<![^\W_])(?:lemon|apple|coconut)(?![^\W_])').str.join(", ")
#  0             lemon
#  1    apple, coconut
#  2                  
#  3      apple, lemon
#  Name: Paths, dtype: object

Hi, Thank you so much for your answer. This is now working correctly. But now I have tried it on 1000 different keywords like apple, coconut, lemon,.... etc. So if I put 1000 keywords inside it findall(). It shows "NaN" in the output even though some of them matches. How can I put 1000 different keywords inside it ? — Nayan Desale, May 24 '21 at 15:20
@NayanDesale Use the solution from [this answer](https://stackoverflow.com/a/42789508/3832970). Let me know if you need help implementing it. — Wiktor Stribiżew, May 24 '21 at 17:21

Pandas extract information which starts with [\s\d_/] and ends in [\s\d_/]

2 Answers2