1

I have a Dataframe like this:

text

Is it possible to apply [NUM] times
Is it possible to apply [NUM] time
Called [NUM] hour ago
waited [NUM] hours
waiting [NUM] minute
waiting [NUM] minutes???
Are you kidding me !
Waiting?

I want to be able to detect pattern that have "[NUM] time" or "[NUM] times" or "[NUM] minute" or "[NUM] minutes" or "[NUM] hour" or "[NUM] hours". Also, if it has "!" (or more than one !) or "??" (at least two ?).

So the result would look like this:

text.                                  available

Is it possible to apply [NUM] times.   True
Is it possible to apply [NUM] time.    True
Called [NUM] hour ago                  True
waited [NUM] hours                     True
waiting [NUM] minute                   True
waiting [NUM] minutes???               True
Are you kidding me !                   True
Waiting?                               False
I didn't like it                       False

So I want something like this but don't know how to put all these condition together:

df["available"] = df['text'].apply(lambda x: re.match(r'[\!* | \?+ | [NUM] time | [NUM] hour | [NUM] minute]')
sariii
  • 2,020
  • 6
  • 29
  • 57

1 Answers1

1

You can use Series.str.contains with a regex:

import pandas as pd
df = pd.DataFrame({'text':["Is it possible to apply [NUM] times","Is it possible to apply [NUM] time","Called [NUM] hour ago","waited [NUM] hours","waiting [NUM] minute","waiting [NUM] minutes???","Are you kidding me !","Waiting?", "I didn't like it"]})
df['available'] = df['text'].str.contains(r'\[NUM]\s*(?:hour|minute|time)s?\b|!|\?{2}', regex=True)
## => df['available']
#     0     True
#     1     True
#     2     True
#     3     True
#     4     True
#     5     True
#     6     True
#     7    False
#     8    False

See the regex demo. Details:

  • \[NUM] - [NUM] string
  • \s* - zero or more whitespaces
  • (?:hour|minute|time) - a non-capturing group matching hour, minute or time
  • s? - an optional s
  • \b - a word boundary
  • | - or
  • ! - a ! char
  • | - or
  • \?{2} - two question marks.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Do you happen to know why the same code does not work when I am using a text rather than the text inside a data frame? `text="Been on hold for [NUM] minutes at that number, AFTER it wouldn't let me cancel the reservation."` and then `available = re.match('r\[NUM]\s*(?:hour|minute|time|number|hr|Hr)s?\b|!{2}|\?{2}', text)` the `available` is NONE. Sorry if I did not open a new question as Im new in `regex` I thought this may be easy question and I received many negative points :(((( – sariii Nov 24 '21 at 17:50
  • 1
    @sariii Correct, you would get lots of downvotes on such a question. The answer is "use `re.search`". See [What is the difference between re.search and re.match?](https://stackoverflow.com/q/180986/3832970) – Wiktor Stribiżew Nov 24 '21 at 17:52
  • yea I figured if I post that question my score will down to -1000 :))). Thanks for sharing the link. However, neither `search` nor `match` do not return any reasonable output. Both returned `None`. The way I understood the difference between them is just for the cases where either `new line` or `^` exist in the sentence. However, my case is just one line so I think both `match` and `search` should be able to do the job. Am I missing something here? – sariii Nov 24 '21 at 19:07
  • 1
    @sariii Your *regex* works, you just made a typo by moving the raw string literal `r` prefix into the string literal itself. See [this Python demo](https://ideone.com/KUTevm). – Wiktor Stribiżew Nov 24 '21 at 19:24
  • Ahhhh that's true, Thanks sooo0 much :) – sariii Nov 24 '21 at 19:29