0

There is something wrong in the code I am using for selecting all the rows which contains one of the strings in the following list:

search_query=['great game', 'gran game']
filtered_query=df[(df['Text'].str.lower().str.contains("|", search_query)) | (df['Low_Content'].str.contains("|", search_query))]

filtered_query.drop_duplicates(subset =["User", "Low_Content"], keep = False, inplace = True)

The code above should filter all the rows which contain at least one of the two strings in the list:

User            Text                             Low_Content
432         Great game!I liked it             We played yesterday
34          Good game, man.                    I like this sport
412         We played a GREAT GAME yesterday    Gran game!!!

The code should select only these rows

  User            Text                             Low_Content
    432         Great game!I liked it             We played yesterday  # it contains Great game in Text
    412         We played a GREAT GAME yesterday    Gran game!!!  # this contains both queries in both columns

I am not interested in finding either great or game: I would like to find both words (same for gran game).

The code above seems to select rows if they contains one of the two words, and not one of the two strings.

I would appreciated your help. Thanks

still_learning
  • 776
  • 9
  • 32
  • Please provide a [mcve]. – AMC Nov 08 '20 at 01:24
  • I provided it. I do not understand your comment – still_learning Nov 08 '20 at 02:04
  • It should be possible to copy/paste a few things and have your code running. As it stands, we're missing parts of the code, and the data is in a very inconvenient and ambiguous format. See [How to make good reproducible pandas examples](https://stackoverflow.com/q/20109391). – AMC Nov 09 '20 at 20:16

3 Answers3

1

You're using .str.contains incorrectly. Via your code .str.contains("|", ...) is calling .str.contains on the string "|". When passed to a regular expression (as .str.contains does), this is performs an or operation where it will match whatever is on the left or the right of the operator. In this case, we have empty strings on both sides of the "|" which is why you're matching all rows (because an empty string will always match a string with things in it)

Example:

>>> import re
>>> re.search("", "abc")
<re.Match object; span=(0, 0), match=''>

What you need to do is join your search_query into a string, where the elements are separated by the "|" (e.g. 'great game|gran game') to check if either exist within another string. Lastly, you'll need to pass case=False to .str.contains so that we can perform case-insensitive matching (e.g. "great game" will match "Great game").

search_elements =['great game', 'gran game']
search_query = "|".join(search_elements)

mask = df["Text"].str.contains(search_query, case=False) | df["Low_Content"].str.contains(search_query, case=False)
subset = df.loc[mask, :]

print(subset)
   User                              Text          Low_Content
0   432             Great game!I liked it  We played yesterday
2   412  We played a GREAT GAME yesterday         Gran game!!!
Cameron Riddell
  • 10,942
  • 9
  • 19
1

You are trying to search "" or "" in strings so if you replace:

df[(df['Text'].str.lower().str.contains("|", search_query)) | (df['Low_Content'].str.contains("|", search_query))]

with:

df[(df['Text'].str.lower().str.contains("great game|gran game")) | (df['Low_Content'].str.contains("great game|gran game"))]

the problem will be solved.

Mehdi Golzadeh
  • 2,594
  • 1
  • 16
  • 28
1

I changed the use of .str.contains() to .str.contains("|".join(search_query).
Now it searches for: 'great game|gran game' which is the correct regex you are looking for.

Working code example:

import pandas as pd
from io import StringIO

text = """
User\tText\tLow_Content
432\tGreat game!I liked it\tWe played yesterday
34\tGood game, man.\tI like this sport
412\tWe played a GREAT GAME yesterday\tGran game!!!
"""

df = pd.read_csv(StringIO(text), header=0, sep='\t')

search_query=['great game', 'gran game']

mask = (
    df['Text'].str.contains("|".join(search_query), case=False) 
    | df['Low_Content'].str.contains("|".join(search_query), case=False)
)

df[mask]

Documentation on .str.contains():
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html

Sander van den Oord
  • 10,986
  • 5
  • 51
  • 96