Python - Findall matching string(s) between two DataFrame columns - sequence item 0: expected str instance, tuple found

Question

I am trying to compare the strings btw two DataFrame columns. category_df['column_text_to_find'] contains string to match in other dataframe df2['column_text_to_search']. The new column df2['matched text'] should return the df['column_text_to_find'] found in df2['column_text_to_search']. my expected result is

      ['column_text_to_search']   ['column_text_to_find']     ['matched text']
   'SP * GRAPHICSDIRECT.ascdadv'     'GRAPHICSDIRECT'          'GRAPHICSDIRECT'
   '99 CENTS ONLY #777#'             '99 CENTS ONLY'           '99 CENTS ONLY'
   'PAYPAL *BESTBUY COM  #3422#'     'BESTBUY'                 'BESTBUY'

Unfortunately, my code returns an error.

csv import:

for f in all_files:
    df = pd.read_csv(f, sep=',',header=[3])
df2 = df

remove blank spaces:

df2['column_text_to_search']=df2['column_text_to_search'].str.strip()

search and match text:

 ch = category_df['column_text_to_find']
 pat = r'\b({0})\b'.format('|'.join(ch))
 df2['matched text'] = df2['column_text_to_search'].str.findall(pat, flags = 
 re.IGNORECASE).map("_".join)
 df2.head()

Error:

TypeError: sequence item 0: expected str instance, tuple found

Can you give a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example), please? Also, show the full stack trace. — nonDucor, Feb 17 '22 at 17:42
@nonDucor I added an example of column_text_to_search and column_text_to_find. They are stored in different dataframe. Does it help? — AZen, Feb 17 '22 at 18:27
You should put it in a way that make it easy to create the dataframe. — nonDucor, Feb 17 '22 at 18:29
@nonDucor The two data frames have a different number of rows. category_df.column_text_to_find has about 200 lines while column_text_to_search has more than 4000. — AZen, Feb 17 '22 at 23:54
Do you need a whole word search? This - `df['matched text'] = df.apply(lambda r: r["column_text_to_find"] if r["column_text_to_find"] in r["column_text_to_search"] else "", axis=1)` - seems to work for you. — Wiktor Stribiżew, Feb 24 '22 at 09:42
Or, if you need a whole word search, try `df["column_text_to_search"].str.findall(r'(?i)\b({0})\b'.format("|".join(df["column_text_to_find"].to_list()))).str.join('_')` — Wiktor Stribiżew, Feb 24 '22 at 09:49

Wiktor Stribiżew · Accepted Answer · 2022-02-25T08:00:01.637

You can use

pattern = r'(?i)\b({0})\b'.format("|".join(df["column_text_to_find"].to_list()))
df["column_text_to_search"].str.findall(pattern).str.join('_')

Or, if your "words" to find can contain special chars anywhere in the string:

pattern = r'(?i)(?!\B\w)({0})(?<!\w\B)'.format("|".join( sorted(map(re.escape, df["column_text_to_find"].to_list()), key=len, reverse=True) ))
df["column_text_to_search"].str.findall(pattern).str.join('_')

Note the use of

(?i) - it enables case insensitive search
\b...\b - word boundaries enable whole word search for natural language words (if the "wors" can contain special chars in arbitrary positions, you cannot rely on word boundaries)
(?!\B\w) / (?<!\w\B) - dynamic adaptive word boundaries that only require a word boundary if the neighbouring char in the word to find is a word char
"|".join(df["column_text_to_find"].to_list()) - forms an alternation based pattern of values inside the column_text_to_find column.
sorted(map(re.escape, df["column_text_to_find"].to_list()), key=len, reverse=True) - sorts the words to find by length in descending order and escapes them for use in regex
.findall(pattern) - finds all occurrences of the pattern and
.str.join('_') - joins them with _.

Thanks! It works for me—only one issue: error: nothing to repeat at position XXXX. I have learned that is something that should be solved using `re. compile ` or `re. escape` . I do not understand how to use it. Do you have any suggestions? — AZen, Feb 25 '22 at 06:07
@AZen See *`\b...\b` - word boundaries enable whole word search for natural language words (if the "wors" can contain special chars in arbitrary positions, you cannot rely on word boundaries)*. The answer is: use **adaptive dynamic word boundaries**. — Wiktor Stribiżew, Feb 25 '22 at 07:57

score 0 · Answer 2 · answered Feb 17 '22 at 18:39

0

here is the code that solves your problem if i understood your problem statement

df2[match_text]=''
for j in range(len(df2)):
    for i in range(len(category_df)):
        if df2.column_text_to_search[j] in category_df.column_text_to_find[i]:
            df2.match_text[j]=df2.column_text_to_search[j]
            break

answered Feb 17 '22 at 18:39

N R

48
3

The two data frames have a different number of rows. category_df.column_text_to_find has about 200 lines while column_text_to_search has more than 4000. – AZen Feb 17 '22 at 23:38
it doesn't matter if count of rows are not same between two dataframes right. If i understand, there is text to find in other column, if that text found in any row in other dataframe, you want to know that. am i right ? – N R Feb 27 '22 at 19:06

Python - Findall matching string(s) between two DataFrame columns - sequence item 0: expected str instance, tuple found

2 Answers2

Linked

Related