Convert string from unicode to utf-8 after stemming

Question

I have a simple string in utf-8 encoding. I am performing stemming using nltk stemmer. But after stemming, it converts the string to unicode. How can I convert it back to utf-8 encoding? Following is the code.

from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('english')

string = "something i am writing" 
string_before_Stem = string.split()
print string_before_Stem

['something', 'i', 'am', 'writing']

string = stemmer.stem(string)
string = string.split()
print string 

[u'something', u'i', u'am', u'writ']

@Henry, OP is using Python 2. In Python 3 `'something'` *is* a Unicode string. — Mark Tolonen, Nov 29 '17 at 03:03
[See here](https://stackoverflow.com/questions/16957226/encode-python-list-to-utf-8) — Henry, Nov 29 '17 at 03:05
Why fight it? Text *should* be Unicode, and in Python 3, Unicode strings are the default. — Mark Tolonen, Nov 29 '17 at 03:07

score 3 · Accepted Answer · answered Nov 29 '17 at 03:04

you can use encode to do said task.

from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('english')

string = "something i am writing" 
string_before_Stem = string.split()
print string_before_Stem

['something', 'i', 'am', 'writing']

string = stemmer.stem(string)
string = string.split()

encoded_string = [s.encode('UTF-8', 'strict') for s in string]

Convert string from unicode to utf-8 after stemming

1 Answers1