45

I have a dataframe and I try to get string, where on of column contain some string Df looks like

member_id,event_path,event_time,event_duration
30595,"2016-03-30 12:27:33",yandex.ru/,1
30595,"2016-03-30 12:31:42",yandex.ru/,0
30595,"2016-03-30 12:31:43",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0
30595,"2016-03-30 12:31:44",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0
30595,"2016-03-30 12:31:45",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0
30595,"2016-03-30 12:31:46",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0
30595,"2016-03-30 12:31:49",kinogo.co/,1
30595,"2016-03-30 12:32:11",kinogo.co/melodramy/,0

And another df with urls

url
003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnyj_telefon_bq_phoenix
003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnyj_telefon_fly_
003\.ru\/sonyxperia
003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnye_telefony_smartfony
003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnye_telefony_smartfony\/brands5D5Bbr_23
1click\.ru\/sonyxperia
1click\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/chasy-motorola

I use

urls = pd.read_csv('relevant_url1.csv', error_bad_lines=False)
substr = urls.url.values.tolist()
data = pd.read_csv('data_nts2.csv', error_bad_lines=False, chunksize=50000)
result = pd.DataFrame()
for i, df in enumerate(data):
    res = df[df['event_time'].str.contains('|'.join(substr), regex=True)]

but it return me

UserWarning: This pattern has match groups. To actually get the groups, use str.extract.

How can I fix that?

Petr Petrov
  • 4,090
  • 10
  • 31
  • 68

5 Answers5

66

The alternative way to get rid of the warning is change the regex so that it is a matching group and not a capturing group. That is the (?:) notation.

Thus, if the matching group is (url1|url2) it should be replaced by (?:url1|url2).

climatebrad
  • 1,286
  • 8
  • 13
45

At least one of the regex patterns in urls must use a capturing group. str.contains only returns True or False for each row in df['event_time'] -- it does not make use of the capturing group. Thus, the UserWarning is alerting you that the regex uses a capturing group but the match is not used.

If you wish to remove the UserWarning you could find and remove the capturing group from the regex pattern(s). They are not shown in the regex patterns you posted, but they must be there in your actual file. Look for parentheses outside of the character classes.

Alternatively, you could suppress this particular UserWarning by putting

import warnings
warnings.filterwarnings("ignore", 'This pattern has match groups')

before the call to str.contains.


Here is a simple example which demonstrates the problem (and solution):

# import warnings
# warnings.filterwarnings("ignore", 'This pattern has match groups') # uncomment to suppress the UserWarning

import pandas as pd

df = pd.DataFrame({ 'event_time': ['gouda', 'stilton', 'gruyere']})

urls = pd.DataFrame({'url': ['g(.*)']})   # With a capturing group, there is a UserWarning
# urls = pd.DataFrame({'url': ['g.*']})   # Without a capturing group, there is no UserWarning. Uncommenting this line avoids the UserWarning.

substr = urls.url.values.tolist()
df[df['event_time'].str.contains('|'.join(substr), regex=True)]

prints

  script.py:10: UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
  df[df['event_time'].str.contains('|'.join(substr), regex=True)]

Removing the capturing group from the regex pattern:

urls = pd.DataFrame({'url': ['g.*']})   

avoids the UserWarning.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
13

You can use str.match instead. In your code:

res = df[df['event_time'].str.match('|'.join(substr), regex=True)]


Explanation

The warning is triggered by str.contains when the regular expression includes groups, e.g. in the regex r'foo(bar)', the (bar) part is considered a group because it is in parenthesis. Therefore you could theoretically extract that from a regex.

However, the warning doesn't make sense in the first place, contains is supposed to only "test if pattern or regex is contained within a string of a Series or Index" (pandas documentation). There is nothing about extracting groups.

In any case, str.match does not throw the warning, and currently does almost the same as str.contains except that (1) the string must exactly match and (2) one cannot deactivate regex from str.match (str.contains has a regex parameter to deactivate them)

toto_tico
  • 17,977
  • 9
  • 97
  • 116
  • `str.match('.*'+regex_string)` has the same expected behavior of `str.contains(regex_string)` with no warning. Only caveat... the `regex_string` shall be a string, not a compiled regular expression. – Marcello Aug 24 '21 at 12:25
  • This answer worked and the other answers did not. Thanks! – Michael Currie Feb 11 '23 at 14:19
9

you should use re.escape(yourString) for the string you are passing to contains.

Rob
  • 91
  • 1
  • 1
6

Since regex=True is provided, sublist gets treated as a regex, which in your case contains capturing groups (strings enclosed with parentheses).

You get the warning because if you want to capture something then there is no use of str.contains (which returns boolean depending upon whether the provided pattern is contained within the string or not)

Obviously, you can suppress the warnings but it's better to fix them.

Either escape the parenthesis blocks or use str.extract if you really want to capture something.

Chankey Pathak
  • 21,187
  • 12
  • 85
  • 133