What can I use for finding names words in two list? Python

Question

I am interested in the finding of the same words in two lists. I have two lists of words in the text_list I also stemmed the words.

text_list = [['i', 'am', 'interest' ,'for', 'this', 'subject'], ['this', 'is', 'a', 'second', 'sentence']]
words_list = ['a', 'word', 'sentence', 'interesting']

So I need this output:

same_words= ['a', 'sentence', 'interest']

Does this answer your question? [Common elements comparison between 2 lists](https://stackoverflow.com/questions/2864842/common-elements-comparison-between-2-lists) — sahasrara62, Jul 01 '22 at 06:55
Is `"interest"` and `"interesting"` supposed to be considered the same? — Aniketh Malyala, Jul 01 '22 at 07:02
Yes, it is the same word, but a different grammatical form, that's why I am searching for an approach in Python, that can return 'interest' and 'interesting' like a same word. — Rina, Jul 01 '22 at 07:12
"I also stemmed the words." What exactly do you mean by this? It doesn't look in the example data as if anything like that happened. — Karl Knechtel, Jul 01 '22 at 07:20
@Rina, see if this is what you need https://stackoverflow.com/a/72825908/16836078 — , Jul 01 '22 at 07:46

score 0 · Answer 1 · answered Jul 01 '22 at 07:12

0

You need to apply stemming to both the lists, There are discrepancies for example interesting and interest and if you apply stemming to only words_list then Sentence becomes sentenc so, therefore, apply stemmer to both the lists and then find the common elements:

from nltk.stem import PorterStemmer

text_list = [['i', 'am', 'interest','for', 'this', 'subject'], ['this', 'is', 'a', 'second', 'sentence']]
words_list = ['a', 'word', 'sentence', 'interesting']

ps = PorterStemmer()
words_list = [ps.stem(w) for w in words_list]
text_list = [list(map(ps.stem,i)) for i in text_list]

answer = []
for i in text_list:
    answer.append(list(set(words_list).intersection(set(i))))

output = sum(answer, [])
print(output)

>>> ['interest', 'a', 'sentenc']

answered Jul 01 '22 at 07:12

darth baba

1,277
5
13

I have an error "object of type 'float' has no let(). It seems like somewhere in the data is a number. How can I fix it? – Rina Jul 01 '22 at 07:23
use regex to handle numbers in your data. `re.sub()` will substitute the digits, or you can drop all those data points that have floats – darth baba Jul 01 '22 at 07:28
May I ask You, for what is the square bracket after the answer in the output line? – Rina Jul 01 '22 at 09:44
`answer = []` ? I'm just initializing the answer array – darth baba Jul 01 '22 at 10:13
No this one: output = sum(answer, []) – Rina Jul 01 '22 at 12:26
@Rina that is to make a flattened 1d list – Jul 01 '22 at 14:00

score 0 · Answer 2 · 2022-07-01T23:57:20.293

0

There is a package called fuzzywuzzy which allows you to match the string from a list with the strings from another list with approximation.

First of all, you will need to flatten your nested list to a list/set with unique strings.

from itertools import chain
newset =  set(chain(*text_list))

{'sentence', 'i', 'interest', 'am', 'is', 'for', 'a', 'second', 'subject', 'this'}

Next, from the fuzzywuzzy package, we import the fuzz function.

from fuzzywuzzy import fuzz

result = [max([(fuzz.token_set_ratio(i,j),j) for j in newset]) for i in words_list]

[(100, 'a'), (57, 'for'), (100, 'sentence'), (84, 'interest')]

by looking at here, the fuzz.token_set_ratio actually helps you to match the every element from the words_list to all the elements in newset and gives the percentage of matching alphabets between the two elements. You can remove the max to see the full list of it. (Some alphabets in for is in the word, that's why it's shown in this tuple list too with 57% of matching. You can later use a for loop and a percentage tolerance to remove those matches below the percentage tolerance)

Finally, you will use map to get your desired output.

similarity_score, fuzzy_match = map(list,zip(*result))

fuzzy_match
Out[40]: ['a', 'for', 'sentence', 'interest']

Extra

If your input is not the usual ASCII standard, you can put another argument in the fuzz.token_set_ratio

a = ['У', 'вас', 'є', 'чашка', 'кави?']

b = ['ви']

[max([(fuzz.token_set_ratio(i, j, force_ascii= False),j) for j in a]) for i in b]
Out[9]: [(67, 'кави?')]

edited Jul 01 '22 at 23:57

answered Jul 01 '22 at 07:39

`for` is not in the expected output. – darth baba Jul 01 '22 at 09:14
Hey, this approach looks very interesting, I tried the code with result and unfortunately it is going an hour in the Jupiter notebook. I am just a beginner in the programming, but maybe it is possible to check in this code, if it is a finite loop? – Rina Jul 01 '22 at 09:23
@darthbaba yes, but I am just showing the usage of the `fuzzywuzzy`. For the `for` , we can see that it has low percentage of matches, so we can make another loop function to remove low percentage matching words. – Jul 01 '22 at 11:24
@Rina how many words do you have in your original lists? – Jul 01 '22 at 11:25
4679 words in the words_list – Rina Jul 01 '22 at 12:30
Try to change the newset to words_list and the words_list to newset when doing the list tuple comprehension – Jul 01 '22 at 13:36
@Rina I have tried with 10000 words, and it still runs very fast. Perhaps try it in Python and see how it goes – Jul 01 '22 at 14:10
Is it language sensitive? Because I am working with Ukrainian alphabetic – Rina Jul 01 '22 at 15:45
Yes apparently Not all the alphabets are supported in this package. However, you can add another arg - `force_ascii = False`. See if it works for you – Jul 01 '22 at 16:02
Where in the code should I place it? – Rina Jul 01 '22 at 17:06
1

I have updated the answer, please check it. Thank you – Jul 01 '22 at 23:57

What can I use for finding names words in two list? Python

2 Answers2

Extra