0

I am interested in the finding of the same words in two lists. I have two lists of words in the text_list I also stemmed the words.

text_list = [['i', 'am', 'interest' ,'for', 'this', 'subject'], ['this', 'is', 'a', 'second', 'sentence']]
words_list = ['a', 'word', 'sentence', 'interesting']

So I need this output:

same_words= ['a', 'sentence', 'interest']
codester_09
  • 5,622
  • 2
  • 5
  • 27
Rina
  • 27
  • 3
  • 2
    Does this answer your question? [Common elements comparison between 2 lists](https://stackoverflow.com/questions/2864842/common-elements-comparison-between-2-lists) – sahasrara62 Jul 01 '22 at 06:55
  • Is `"interest"` and `"interesting"` supposed to be considered the same? – Aniketh Malyala Jul 01 '22 at 07:02
  • Yes, it is the same word, but a different grammatical form, that's why I am searching for an approach in Python, that can return 'interest' and 'interesting' like a same word. – Rina Jul 01 '22 at 07:12
  • "I also stemmed the words." What exactly do you mean by this? It doesn't look in the example data as if anything like that happened. – Karl Knechtel Jul 01 '22 at 07:20
  • 2
    Sounds fuzzywuzzy to me. ; ) –  Jul 01 '22 at 07:21
  • @Rina, see if this is what you need https://stackoverflow.com/a/72825908/16836078 –  Jul 01 '22 at 07:46

2 Answers2

0

You need to apply stemming to both the lists, There are discrepancies for example interesting and interest and if you apply stemming to only words_list then Sentence becomes sentenc so, therefore, apply stemmer to both the lists and then find the common elements:

from nltk.stem import PorterStemmer

text_list = [['i', 'am', 'interest','for', 'this', 'subject'], ['this', 'is', 'a', 'second', 'sentence']]
words_list = ['a', 'word', 'sentence', 'interesting']

ps = PorterStemmer()
words_list = [ps.stem(w) for w in words_list]
text_list = [list(map(ps.stem,i)) for i in text_list]

answer = []
for i in text_list:
    answer.append(list(set(words_list).intersection(set(i))))

output = sum(answer, [])
print(output)

>>> ['interest', 'a', 'sentenc']
darth baba
  • 1,277
  • 5
  • 13
0

There is a package called fuzzywuzzy which allows you to match the string from a list with the strings from another list with approximation.

First of all, you will need to flatten your nested list to a list/set with unique strings.

from itertools import chain
newset =  set(chain(*text_list))

{'sentence', 'i', 'interest', 'am', 'is', 'for', 'a', 'second', 'subject', 'this'}

Next, from the fuzzywuzzy package, we import the fuzz function.

from fuzzywuzzy import fuzz

result = [max([(fuzz.token_set_ratio(i,j),j) for j in newset]) for i in words_list]

[(100, 'a'), (57, 'for'), (100, 'sentence'), (84, 'interest')]

by looking at here, the fuzz.token_set_ratio actually helps you to match the every element from the words_list to all the elements in newset and gives the percentage of matching alphabets between the two elements. You can remove the max to see the full list of it. (Some alphabets in for is in the word, that's why it's shown in this tuple list too with 57% of matching. You can later use a for loop and a percentage tolerance to remove those matches below the percentage tolerance)

Finally, you will use map to get your desired output.

similarity_score, fuzzy_match = map(list,zip(*result))

fuzzy_match
Out[40]: ['a', 'for', 'sentence', 'interest']

Extra

If your input is not the usual ASCII standard, you can put another argument in the fuzz.token_set_ratio

a = ['У', 'вас', 'є', 'чашка', 'кави?']

b = ['ви']

[max([(fuzz.token_set_ratio(i, j, force_ascii= False),j) for j in a]) for i in b]
Out[9]: [(67, 'кави?')]
  • `for` is not in the expected output. – darth baba Jul 01 '22 at 09:14
  • Hey, this approach looks very interesting, I tried the code with result and unfortunately it is going an hour in the Jupiter notebook. I am just a beginner in the programming, but maybe it is possible to check in this code, if it is a finite loop? – Rina Jul 01 '22 at 09:23
  • @darthbaba yes, but I am just showing the usage of the `fuzzywuzzy`. For the `for` , we can see that it has low percentage of matches, so we can make another loop function to remove low percentage matching words. –  Jul 01 '22 at 11:24
  • @Rina how many words do you have in your original lists? –  Jul 01 '22 at 11:25
  • 4679 words in the words_list – Rina Jul 01 '22 at 12:30
  • Try to change the newset to words_list and the words_list to newset when doing the list tuple comprehension –  Jul 01 '22 at 13:36
  • @Rina I have tried with 10000 words, and it still runs very fast. Perhaps try it in Python and see how it goes –  Jul 01 '22 at 14:10
  • Is it language sensitive? Because I am working with Ukrainian alphabetic – Rina Jul 01 '22 at 15:45
  • Yes apparently Not all the alphabets are supported in this package. However, you can add another arg - `force_ascii = False`. See if it works for you –  Jul 01 '22 at 16:02
  • Where in the code should I place it? – Rina Jul 01 '22 at 17:06
  • 1
    I have updated the answer, please check it. Thank you –  Jul 01 '22 at 23:57