2

I've switched Python 3.6 now, but when running informative features I end up with gibberish when trying to print Russian in my feature extractor.

Most Informative Features
  three_last_letters = 'оÌ'            noun : verb   =      6.6 : 1.0
  three_last_letters = 'гÐ'            noun : verb   =      5.4 : 1.0
  three_last_letters = 'еÐ'            noun : verb   =      4.7 : 1.0
  three_last_letters = 'мÐ'            noun : verb   =      4.4 : 1.0
  three_last_letters = 'нÑ'            noun : verb   =      3.5 : 1.0

In the case of the feature extractor itself

def POS_features(word):
    return{'three_last_letters':word[-3:]}
print(POS_features(u'Богатир'))

I can get тир to print just fine, is there something I can do to make the informative features return Russian characters?

  • on what os? and do you try the print in a console or ide? – dima May 08 '17 at 14:31
  • 1
    I code on the Jupyter notebook on Windows. – reivermello May 08 '17 at 14:35
  • 2
    If this wasn't the case with Python 3.5, it could be because of this change: "PEP 528 and PEP 529, Windows filesystem and console encoding changed to UTF-8.". Sorry, don't have a proper solution but try to experiment with `sys.setdefaultencoding` and check `sys.stdout.encoding`. – drdaeman May 08 '17 at 14:56
  • What is your Jupyter notebook version? – alvas May 08 '17 at 15:17
  • Could you upload your training data sample or a pickle of your model somewhere so that we can download and help you debug? – alvas May 08 '17 at 15:22
  • @alvas version 4.3.1, also I'm still relatively new to coding and stackoverflow, what do you mean by a pickle? – reivermello May 08 '17 at 15:37
  • 1
    @reivermello: [`pickle` is a serialization library](https://docs.python.org/3/library/pickle.html). – 9000 May 08 '17 at 15:51
  • What encoding are your input files? If your mangled output form a Jupyter notebook, or the windows command line? – 9000 May 08 '17 at 15:54
  • 1
    @reivermello Instead of writing the solution into the question, write it as an actual answer below. That way you can mark something as accepted and close the thread. – Tomalak May 08 '17 at 18:53

1 Answers1

3

I figured out what I'd done wrong,

vocab = nltk.corpus.reader.CategorizedPlaintextCorpusReader(
"C:\\Users\\Admin\\AppData\\Roaming\\nltk_data\\corpora\\russian\\vocab", r'.*\.txt', cat_pattern=r'^(noun|verb)', encoding="utf8"

when I'd imported my vocab folder, I'd encoded it as latin-1 all is well and Cyrillic characters were returned for me

 Most Informative Features
      three_last_letters = 'ать'            verb : noun   =     15.2 : 1.0
      three_last_letters = 'де'             noun : verb   =      2.6 : 1.0
      three_last_letters = 'сть'            noun : verb   =      1.5 : 1.0
      three_last_letters = 'пра'            noun : verb   =      1.4 : 1.0
      three_last_letters = 'ина'            noun : verb   =      1.4 : 1.0
  • 1
    note that you can use raw strings to write Windows paths (or any other string that contains backslashes) in Python source code. `r'C:\path\to\file'`. See http://stackoverflow.com/questions/2081640/what-exactly-do-u-and-r-string-flags-do-in-python-and-what-are-raw-string-l – Tomalak May 09 '17 at 08:43