Problems using snowballstemmer for a list of Turkish words in Python

Question

I'm trying to use a library called snowballstemmer in Python, but it seems that it's not working as expected. What could the reason be? Please see my code below.

My data set:

df=[['musteri', 'hizmetlerine', 'cabuk', 'baglaniyorum'],['konuda', 'yardımcı', 'oluyorlar', 
   'islemlerimde']]

I have applied snowballstemmer package and import TurkishStemmer

  from snowballstemmer import TurkishStemmer
  turkStem=TurkishStemmer()
  data_words_nostops=[turkStem.stemWord(word) for word in df]
  data_words_nostops

  [['musteri', 'hizmetlerine', 'cabuk', 'baglaniyorum'],
   ['konuda', 'yardımcı', 'oluyorlar', 'islemlerimde']]

Unfortunately it didn't work. But when I applied it to single words, it works as expected:

 turkStem.stemWord("islemlerimde")
 'islem'

What could be the problem? Any help will be appreciated.

Thank you.

linqo · Accepted Answer · 2020-05-03T03:50:57.853

5

Did you mean to have a list of strings instead of a list of lists containing strings?

I was able to get the stems for each word when I reformatted your code this way:

from snowballstemmer import TurkishStemmer

df = [
    'musteri',
    'hizmetlerine',
    'cabuk',
    'baglaniyorum',
    'konuda',
    'yardımcı',
    'oluyorlar',
    'islemlerimde'
]
turkStem = TurkishStemmer()
data_words_nostops = [turkStem.stemWord(word) for word in df]
print(data_words_nostops)

If you have a list of lists of strings (lets say its what you've defined as df) and you want to flatten it down to a single list of words, you can do something like this:

df = [
    ['musteri', 'hizmetlerine', 'cabuk', 'baglaniyorum'],
    ['konuda', 'yardımcı', 'oluyorlar', 'islemlerimde']
]
flattened_df = [item for sublist in df for item in sublist]

# Output:
# ['musteri', 'hizmetlerine', 'cabuk', 'baglaniyorum', 'konuda', 'yardımcı', 'oluyorlar', 'islemlerimde']

Credit for the above goes to this StackOverflow post.

Alternatively, you could just correct the looping to address the problem with your original layout:

df = [
    ['musteri', 'hizmetlerine', 'cabuk', 'baglaniyorum'],
    ['konuda', 'yardımcı', 'oluyorlar', 'islemlerimde']
]
turkStem = TurkishStemmer()
all_stem_lists = []

for word_group in df:
    output_stems = []
    for word in word_group:
        stem = turkStem.stemWord(word)
        output_stems.append(stem)
    all_stem_lists.append(output_stems)

print(all_stem_lists)

edited May 03 '20 at 03:50

answered May 03 '20 at 03:17

linqo

617
4
16

Actually I do have a list and I need to apply to each single word in a list. But I could not. Thank you for the response – melik May 03 '20 at 03:24
@melik do you see the difference between my version and yours? In yours you had grouped the words together which had confused the list comprehension. Even though you said "for word in df", "word" in your case meant a list of words instead of a single one. Anyway, please mark the answer as solved if this helped you, and let me know if I can help clarify anything further. – linqo May 03 '20 at 03:26
actually I tokenized the words from a text and I need to apply this into list. How can I do that? Thank you. – melik May 03 '20 at 03:29
@melik I've added some additional clarification on how to flatten your list of lists, or alternatively loop over each nested list of words. Hopefully this should help you figure out what you need to do. – linqo May 03 '20 at 03:38
thanks but the final outcome should also be in a list like [ ['musteri', 'hizmet', 'cabuk', 'baglaniyor'], ['konuda', 'yardımcı', 'oluyor', 'islem'] ] – melik May 03 '20 at 03:48
See the last block of code (just updated), should do what you want now. – linqo May 03 '20 at 03:52

Problems using snowballstemmer for a list of Turkish words in Python

1 Answers1