0

I made several attempts of other questions already answered on the subject plus my code always returns the error. The only purpose of this code is to just put the tag in the sentences of a document and dump to a file the sentences that contain more than N occurrences of a particular POS of your choice:

import os
import nlpnet
import codecs

TAGGER = nlpnet.POSTagger('pos-pt', language='pt')


# You could have a function that tagged and verified if a
# sentence meets the criteria for storage.

def is_worth_saving(text, pos, pos_count):
   # tagged sentences are lists of tagged words, which in
   # nlpnet are (word, pos) tuples. Tagged texts may contain
   # several sentences.
   pos_words = [word for sentence in TAGGER.tag(text)
             for word in sentence
             if word[1] == pos]
   return len(pos_words) >= pos_count



with codecs.open('dataset.txt', encoding='utf8') as original_file:
with codecs.open('dataset_new.txt', 'w') as output_file:
    for text in original_file:
        # For example, only save sentences with more than 5 verbs in it
        if is_worth_saving(text, 'V', 5):
            output_file.write(text + os.linesep)

Error compiled:

Traceback (most recent call last):
   File "D:/Word Sorter/Classifier.py", line 31, in <module>
     output_file.write(text + os.linesep)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 161-162: ordinal not in range(128)
Jeferson S
  • 31
  • 5

1 Answers1

1

Have you seen these questions before?

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128) and Again: UnicodeEncodeError: ascii codec can't encode

It is exactly the same as your error. So my guess is that you will need to encode your text using text.encode('utf8').

EDIT:

Try using it here:

output_file.write(text.encode('utf8') + os.linesep)
hridayns
  • 697
  • 8
  • 16