0

I would like to search within more columns some words selected by input. It could be just one word or more than one (so a list of words). This is how my dataset looks like:

Text1                                 Text2
Include details about your goal...    Include any error messages...
Describe expected and actual results  Help you with specific coding, algorithm, or language problems: 4.5 points

Let us say that, for example, I would like to select only elements that contain goal or error, words that I added manually by using input:

I should have:

Text1                                 Text2
Include details about your goal...    Include any error messages...

I could also create a new column that add a boolean value in case there is a match with my search/filter request for one/both columns/terms (True) or none (False).

I have tried as follows:

def matcher(x):
    for i in list_to_search:
        if i.lower() in x.lower():
            return i
    else:
        return np.nan

list_to_search = [] 
while True:
    query = input("Enter your query: ")
    list_to_search.append(query)
    print(list_to_search)

print(list_to_search)

df['Match'] = df['Text1'].apply(matcher)

However the code above run forever because of while True condition. From the example above, I would like only to search for goals or errors, but I would have also searched for goals and errors. And I would also be interested in looking for possible consecutive words (for example 4.5 points). Finally, I do not know how to check for multiple columns (in the code above I am looking only for matching in column Text1, not in Text1 and/or Text2).

I hope you can help me. Thanks

UPDATE: I think I have fixed the issue with the while loop:

list_to_search = [] 

def matcher(x):
    for i in list_to_search:
        if i.lower() in x.lower():
            return i
    else:
        return np.nan

while True:
    query = input("Enter your query: ")
    if query!='-f':
        list_to_search.append(query)
        print(list_to_search)

    if query=='-f':
        break
print(list_to_search)

df['Match'] = df['Text1'].apply(matcher)
still_learning
  • 776
  • 9
  • 32

1 Answers1

1

you use Series.str.lower() and Series.str.contains() (regex=True by default)

m = df['Text1'].str.lower().str.contains('|'.join(mylist))

then you can filter Text1 column:

df.loc[m,'Text1']

I think we have to try avoid use apply here, you can see it about when use apply in your code

if you want look for words in a list of columns you can use:

my_columns = ['Text1', 'Text2']

m = (df[my_columns].stack(dropna=False).str.lower()
                   .str.contains('|'.join(mylist))
                   .unstack())

df.where(m)

We can also iter through columns:

for name_colum, column_serie in df[my_columns].items():
    m = (column_serie.str.lower()
                     .str.contains('|'.join(mylist)))
   print(column_serie[m])

Here is an example:

l = ['a','b']
mask = '|'.join(l)
s = pd.Series(['a','b','c'])
s.str.lower().str.contains(mask)

Finally you can also see:

Series.str.match instead Series.str.contains

difference between pandas series str match and pandas series str contains

ansev
  • 30,322
  • 5
  • 17
  • 31
  • Using the first line of code for `m` I am getting the following error: `TypeError: 'builtin_function_or_method' object is not subscriptable` . I do not know if the issue comes from the input function – still_learning Apr 25 '20 at 18:45
  • sorry I forget: *`()`* in `'|'.join([list of words])`, example: `'|'.join([l'hello', 'home', 'make'])` – ansev Apr 25 '20 at 18:49
  • this one: `m = df['Text1'].str.lower().str.contains('|'.join([my_list]))`. I used a fake list: `my_list=['ball','bag']` as I do not know if the way to add items to search is correct or not – still_learning Apr 25 '20 at 18:51
  • 1
    Sorry, I understand that my explanation was not clear. THe correct way is `m = df['Text1'].str.lower().str.contains('|'.join(['ball', 'bag']))` – ansev Apr 25 '20 at 18:53
  • Changing in: `m = df['Titles'].str.lower().str.contains('|'.join(my_list))` I have not got any error. It should work. May I ask you f it is ok to use my input list? – still_learning Apr 25 '20 at 18:53
  • I think you can perfectly use an input list that is loaded with a while loop just like you do in your code. This code works even when the list only has a single word. – ansev Apr 25 '20 at 19:04
  • Also I would use : `while query != '-f':` , so you don't need use any clause `if` inside `while loop ` and you don't need `break` – ansev Apr 25 '20 at 19:07
  • By using `m = (df[my_columns].stack(dropna=False).str.lower() .str.contains('|'.join(mylist)) .unstack())` I got the following error: `AttributeError: 'DataFrame' object has no attribute 'str'`. Do you know why? – still_learning Apr 25 '20 at 19:25
  • Are you using MultiIndex in columns? DataFrame.stack is used here in order to convert the dataframe in a stacked series and then we can use series.str.contains... so I think you could have a Multicolumns Dataframe, check isinstance(df.stack(), pd.Series) – ansev Apr 25 '20 at 22:37