Informative features are not returning Cyrillic characters

Question

I've switched Python 3.6 now, but when running informative features I end up with gibberish when trying to print Russian in my feature extractor.

Most Informative Features
  three_last_letters = 'Ð¾Ì'            noun : verb   =      6.6 : 1.0
  three_last_letters = 'Ð³Ð'            noun : verb   =      5.4 : 1.0
  three_last_letters = 'ÐµÐ'            noun : verb   =      4.7 : 1.0
  three_last_letters = 'Ð¼Ð'            noun : verb   =      4.4 : 1.0
  three_last_letters = 'Ð½Ñ'            noun : verb   =      3.5 : 1.0

In the case of the feature extractor itself

def POS_features(word):
    return{'three_last_letters':word[-3:]}
print(POS_features(u'Богатир'))

I can get тир to print just fine, is there something I can do to make the informative features return Russian characters?

If this wasn't the case with Python 3.5, it could be because of this change: "PEP 528 and PEP 529, Windows filesystem and console encoding changed to UTF-8.". Sorry, don't have a proper solution but try to experiment with `sys.setdefaultencoding` and check `sys.stdout.encoding`. — drdaeman, May 08 '17 at 14:56
Could you upload your training data sample or a pickle of your model somewhere so that we can download and help you debug? — alvas, May 08 '17 at 15:22
@alvas version 4.3.1, also I'm still relatively new to coding and stackoverflow, what do you mean by a pickle? — reivermello, May 08 '17 at 15:37
@reivermello: [`pickle` is a serialization library](https://docs.python.org/3/library/pickle.html). — 9000, May 08 '17 at 15:51
What encoding are your input files? If your mangled output form a Jupyter notebook, or the windows command line? — 9000, May 08 '17 at 15:54
@reivermello Instead of writing the solution into the question, write it as an actual answer below. That way you can mark something as accepted and close the thread. — Tomalak, May 08 '17 at 18:53

score 3 · Accepted Answer · answered May 08 '17 at 20:30

I figured out what I'd done wrong,

vocab = nltk.corpus.reader.CategorizedPlaintextCorpusReader(
"C:\\Users\\Admin\\AppData\\Roaming\\nltk_data\\corpora\\russian\\vocab", r'.*\.txt', cat_pattern=r'^(noun|verb)', encoding="utf8"

when I'd imported my vocab folder, I'd encoded it as latin-1 all is well and Cyrillic characters were returned for me

 Most Informative Features
      three_last_letters = 'ать'            verb : noun   =     15.2 : 1.0
      three_last_letters = 'де'             noun : verb   =      2.6 : 1.0
      three_last_letters = 'сть'            noun : verb   =      1.5 : 1.0
      three_last_letters = 'пра'            noun : verb   =      1.4 : 1.0
      three_last_letters = 'ина'            noun : verb   =      1.4 : 1.0

note that you can use raw strings to write Windows paths (or any other string that contains backslashes) in Python source code. `r'C:\path\to\file'`. See http://stackoverflow.com/questions/2081640/what-exactly-do-u-and-r-string-flags-do-in-python-and-what-are-raw-string-l — Tomalak, May 09 '17 at 08:43

Informative features are not returning Cyrillic characters

1 Answers1