3

I am trying to extract set of keywords such as ['lemon', 'apple', 'coconut'] etc. from the paths such as "\var\prj\lemon_123\xyz", "\var\prj\123_apple\coconut", "\var\prj\lemonade\coconutapple", "\var\prj\apple\lemon"

The expected output is little complex:

Paths MatchedKeywords
"/var/prj/lemon_123/xyz" lemon
"/var/prj/123_apple/coconut" apple, coconut
"/var/prj/lemonade/coconutapple"
"/var/prj/apple/lemon" apple, lemon

keep in mind that the third row does not have the exact word which start with /, \s, \d or _ thats why there is no match. The regular expression is kind of like this: \s\d_/[\s\d_/]. I tried using:

df['Paths'].str.findall(r'[^\s\d_/]lemon|apple|coconut[\s\d_/$]', flags=re.IGNORECASE)

But it is still showing 'lemon' and 'coconut' in the third row.

Thank you in advance.

2 Answers2

1

Your regex is not correct for what you're looking to match, which is easy to see with visualization tools like https://regexper.com/ (no affiliation; just grabbed the top Google result).

You have: [^\s\d_/]lemon|apple|coconut[\s\d_/$]

[^\s\d_/]lemon|apple|coconut[\s\d_/$]

but likely want something like: [\s\d_/](lemon|apple|coconut)[\s\d_/]

enter image description here

Randy
  • 14,349
  • 2
  • 36
  • 42
  • Now it is only matching (finding) the first keyword only. From the second row, it is only matching 'coconut' and not 'apple'. Thanks. – Nayan Desale May 21 '21 at 18:49
  • Do you have that backwards? The regex I shared should find `apple` but not `coconut` since coconut doesn't end with `[\s\d_/]`. – Randy May 21 '21 at 18:58
  • Yes you are right. It is finding apple and not coconut. I am sorry for my wrong comment. But how can we get both of them? and with that specific regex? – Nayan Desale May 21 '21 at 19:04
1

You can use

df['Paths'].str.findall(r'(?<![^\W_])(?:lemon|apple|coconut)(?![^\W_])').str.join(", ")
df['Paths'].str.findall(r'(?<![^\W\d_])(?:lemon|apple|coconut)(?![^\W\d_])').str.join(", ")

See the regex demo (and regex demo #2), the regex matches

  • (?<![^\W_]) - a location that is not immediately preceded with a char other than a non-word char and an underscore (it is a left-hand word boundary with the _ subtracted from it)
  • (?:lemon|apple|coconut) - a non-capturing group matching any of the words inside the group
  • (?![^\W_]) - a location that is not immediately followed with a char other than a non-word char and an underscore (it is a right-hand word boundary with the _ subtracted from it).

If you use (?<![^\W\d_]) and (?![^\W\d_]) your word boundaries will be letter boundaries, i.e. it will be \b with digits and underscore subtracted from it. See the Python demo:

import pandas as pd
df = pd.DataFrame({"Paths":["/var/prj/lemon_123/xyz", "/var/prj/123_apple/coconut", "/var/prj/lemonade/coconutapple", "/var/prj/apple/lemon"]})
df['Paths'].str.findall(r'(?<![^\W_])(?:lemon|apple|coconut)(?![^\W_])').str.join(", ")
#  0             lemon
#  1    apple, coconut
#  2                  
#  3      apple, lemon
#  Name: Paths, dtype: object
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Hi, Thank you so much for your answer. This is now working correctly. But now I have tried it on 1000 different keywords like apple, coconut, lemon,.... etc. So if I put 1000 keywords inside it findall(). It shows "NaN" in the output even though some of them matches. How can I put 1000 different keywords inside it ? – Nayan Desale May 24 '21 at 15:20
  • @NayanDesale Use the solution from [this answer](https://stackoverflow.com/a/42789508/3832970). Let me know if you need help implementing it. – Wiktor Stribiżew May 24 '21 at 17:21
  • 1
    Thank you so much Wiktor. This is really helpful !! – Nayan Desale Jun 01 '21 at 22:07