1

I am trying to compare the strings btw two DataFrame columns. category_df['column_text_to_find'] contains string to match in other dataframe df2['column_text_to_search']. The new column df2['matched text'] should return the df['column_text_to_find'] found in df2['column_text_to_search']. my expected result is

      ['column_text_to_search']   ['column_text_to_find']     ['matched text']
   'SP * GRAPHICSDIRECT.ascdadv'     'GRAPHICSDIRECT'          'GRAPHICSDIRECT'
   '99 CENTS ONLY #777#'             '99 CENTS ONLY'           '99 CENTS ONLY'
   'PAYPAL *BESTBUY COM  #3422#'     'BESTBUY'                 'BESTBUY'

  

Unfortunately, my code returns an error.

csv import:

for f in all_files:
    df = pd.read_csv(f, sep=',',header=[3])
df2 = df

remove blank spaces:

df2['column_text_to_search']=df2['column_text_to_search'].str.strip()

search and match text:

 ch = category_df['column_text_to_find']
 pat = r'\b({0})\b'.format('|'.join(ch))
 df2['matched text'] = df2['column_text_to_search'].str.findall(pat, flags = 
 re.IGNORECASE).map("_".join)
 df2.head()

Error:

TypeError: sequence item 0: expected str instance, tuple found
AZen
  • 15
  • 5
  • Can you give a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example), please? Also, show the full stack trace. – nonDucor Feb 17 '22 at 17:42
  • @nonDucor I added an example of column_text_to_search and column_text_to_find. They are stored in different dataframe. Does it help? – AZen Feb 17 '22 at 18:27
  • You should put it in a way that make it easy to create the dataframe. – nonDucor Feb 17 '22 at 18:29
  • @nonDucor The two data frames have a different number of rows. category_df.column_text_to_find has about 200 lines while column_text_to_search has more than 4000. – AZen Feb 17 '22 at 23:54
  • Do you need a whole word search? This - `df['matched text'] = df.apply(lambda r: r["column_text_to_find"] if r["column_text_to_find"] in r["column_text_to_search"] else "", axis=1)` - seems to work for you. – Wiktor Stribiżew Feb 24 '22 at 09:42
  • Or, if you need a whole word search, try `df["column_text_to_search"].str.findall(r'(?i)\b({0})\b'.format("|".join(df["column_text_to_find"].to_list()))).str.join('_')` – Wiktor Stribiżew Feb 24 '22 at 09:49

2 Answers2

2

You can use

pattern = r'(?i)\b({0})\b'.format("|".join(df["column_text_to_find"].to_list()))
df["column_text_to_search"].str.findall(pattern).str.join('_')

Or, if your "words" to find can contain special chars anywhere in the string:

pattern = r'(?i)(?!\B\w)({0})(?<!\w\B)'.format("|".join( sorted(map(re.escape, df["column_text_to_find"].to_list()), key=len, reverse=True) ))
df["column_text_to_search"].str.findall(pattern).str.join('_')

Note the use of

  • (?i) - it enables case insensitive search
  • \b...\b - word boundaries enable whole word search for natural language words (if the "wors" can contain special chars in arbitrary positions, you cannot rely on word boundaries)
  • (?!\B\w) / (?<!\w\B) - dynamic adaptive word boundaries that only require a word boundary if the neighbouring char in the word to find is a word char
  • "|".join(df["column_text_to_find"].to_list()) - forms an alternation based pattern of values inside the column_text_to_find column.
  • sorted(map(re.escape, df["column_text_to_find"].to_list()), key=len, reverse=True) - sorts the words to find by length in descending order and escapes them for use in regex
  • .findall(pattern) - finds all occurrences of the pattern and
  • .str.join('_') - joins them with _.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks! It works for me—only one issue: error: nothing to repeat at position XXXX. I have learned that is something that should be solved using `re. compile ` or `re. escape` . I do not understand how to use it. Do you have any suggestions? – AZen Feb 25 '22 at 06:07
  • @AZen See *`\b...\b` - word boundaries enable whole word search for natural language words (if the "wors" can contain special chars in arbitrary positions, you cannot rely on word boundaries)*. The answer is: use **adaptive dynamic word boundaries**. – Wiktor Stribiżew Feb 25 '22 at 07:57
0

here is the code that solves your problem if i understood your problem statement

df2[match_text]=''
for j in range(len(df2)):
    for i in range(len(category_df)):
        if df2.column_text_to_search[j] in category_df.column_text_to_find[i]:
            df2.match_text[j]=df2.column_text_to_search[j]
            break
N R
  • 48
  • 3
  • The two data frames have a different number of rows. category_df.column_text_to_find has about 200 lines while column_text_to_search has more than 4000. – AZen Feb 17 '22 at 23:38
  • it doesn't matter if count of rows are not same between two dataframes right. If i understand, there is text to find in other column, if that text found in any row in other dataframe, you want to know that. am i right ? – N R Feb 27 '22 at 19:06