0

I have and million entry dataset that contains observations typed by humans to indicate certain 'operational' outcomes. Trying to create some categories i need to look at this column and extract certain EXACT! expressions that are most commonly used. They can appear at the start, end or middle of the string, and may or may not be abbreviated.

I have constructed the following example:

data = {'file': ['1','2','3','4','5','6'],
        'observations': ['text one address', 'text 2 some', 
                         'text home 3', 'notified text 4',
                         'text 5 add','text 6 homer']}

df = pd.DataFrame(data=data)

I am trying to use pandas to see if i can isolate and extract say 'home','not' and 'address'. I have tried the following approach... (the '|'join taken from another answer on this site)

conditions = ['home','not','address']
test = df[df['observations'].str.contains('|'.join(conditions))]

str.contains Won't work because it picks up 6: 'text 6 homer' as it contains 'home' (the real case its even worse because with abbreviations there is stuff like 'ho', for example.
str.match won't work because it will pickup 'notified'.
str.fullmatch won't work because it can only look for exact strings, and these are long sentences...

Help appreciated...

jcf
  • 180
  • 9
  • 1
    What do you mean by *extract say 'home','not' and 'address*` here? Do you just need these substrings from the dataframe column? – ThePyGuy Jul 29 '21 at 19:41
  • Need to create a new dataframe subset with rows that contain those exact strings in the middle of the long sentences. – jcf Jul 29 '21 at 19:51

1 Answers1

4

Is it what you expect:

>>> df[df['observations'].str.contains(fr"\b(?:{'|'.join(conditions)})\b")]

  file      observations
0    1  text one address
2    3       text home 3

\b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)

(?:...) non-capturing group

Corralien
  • 109,409
  • 8
  • 28
  • 52
  • 1
    Yeah, this works. And if needed to extract those substrings from the pattern, then `str.extract` can be used passing the regex. – ThePyGuy Jul 29 '21 at 19:47
  • Worked like a charm, thanks a lot. I'll go study regex express. now. – jcf Jul 29 '21 at 19:50
  • 1
    @jcf, You can look at [Learning Regular Expressions](https://stackoverflow.com/questions/4736/learning-regular-expressions) to get a good idea of regex and the meta characters. – ThePyGuy Jul 29 '21 at 19:53