0

I would like to classify elements in rows (I am using pandas data frame) based on a list of strings:

list_topics=['orange', 'sports', 'technology', 'apple pie','fruits']

I want to see if websites contains one of that string in order to classify them.

For example:

 Website
www.apple.com
www.orange_is_the_new_black.co.uk
...
www.mitapple.com

These elements are stored in a row (row[0]). I have tried as follows:

    writer = csv.writer(f_output, lineterminator='\n')
    reader = csv.reader(f_input)

    header = next(reader)
    header.append('Classification')
    writer.writerow(header)

    for row in reader:
        check_el = ['not classified']
        for x in list_topics:
            if row[0].str.contains[x]:
                check_el[0] = x
        writer.writerow(row + match)

However it returns only not classified, rather than (expected output):

  Website                                Topics
www.apple.com                            apple
www.orange_is_the_new_black.co.uk        orange
...                                      ...
www.mitapple.com                         apple

Could you please tell me how to compare each row to strings in the list and see if the row contains one of that string?

Thanks

yatu
  • 86,083
  • 12
  • 84
  • 139

1 Answers1

0

You can first flatten your list of topics by splitting the inner strings in a list comprehension, and then use str.extract to find matches with the items in the list:

l = [j for i in list_topics for j in i.split()]
df['Topics'] = df.Website.str.extract(rf"({'|'.join(l)})")

print(df)

                             Website  Topics
0                      www.apple.com   apple
1  www.orange_is_the_new_black.co.uk  orange
2                   www.mitapple.com   apple

Since it seems that the strings in the list to match could be slightly different, I'd suggest you to look into fuzzy searching too. For this case, you could use fuzzy_merge from this post to get a similar result:

df_topics = pd.DataFrame(list_topics, columns=['topics'])

fuzzy_merge(df, df_topics, left_on='Website', right_on='topics', how='left', cutoff=0.25)

                             Website  Topics     topics
0                      www.apple.com   apple  apple pie
1  www.orange_is_the_new_black.co.uk  orange     orange
2                   www.mitapple.com   apple        NaN
yatu
  • 86,083
  • 12
  • 84
  • 139