0

I have a simple string in utf-8 encoding. I am performing stemming using nltk stemmer. But after stemming, it converts the string to unicode. How can I convert it back to utf-8 encoding? Following is the code.

from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('english')

string = "something i am writing" 
string_before_Stem = string.split()
print string_before_Stem

['something', 'i', 'am', 'writing']

string = stemmer.stem(string)
string = string.split()
print string 

[u'something', u'i', u'am', u'writ']
Haroon S.
  • 2,533
  • 6
  • 20
  • 39

1 Answers1

3

you can use encode to do said task.

from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('english')

string = "something i am writing" 
string_before_Stem = string.split()
print string_before_Stem

['something', 'i', 'am', 'writing']

string = stemmer.stem(string)
string = string.split()

encoded_string = [s.encode('UTF-8', 'strict') for s in string]
Mike Tung
  • 4,735
  • 1
  • 17
  • 24