I made several attempts of other questions already answered on the subject plus my code always returns the error. The only purpose of this code is to just put the tag in the sentences of a document and dump to a file the sentences that contain more than N occurrences of a particular POS of your choice:
import os
import nlpnet
import codecs
TAGGER = nlpnet.POSTagger('pos-pt', language='pt')
# You could have a function that tagged and verified if a
# sentence meets the criteria for storage.
def is_worth_saving(text, pos, pos_count):
# tagged sentences are lists of tagged words, which in
# nlpnet are (word, pos) tuples. Tagged texts may contain
# several sentences.
pos_words = [word for sentence in TAGGER.tag(text)
for word in sentence
if word[1] == pos]
return len(pos_words) >= pos_count
with codecs.open('dataset.txt', encoding='utf8') as original_file:
with codecs.open('dataset_new.txt', 'w') as output_file:
for text in original_file:
# For example, only save sentences with more than 5 verbs in it
if is_worth_saving(text, 'V', 5):
output_file.write(text + os.linesep)
Error compiled:
Traceback (most recent call last):
File "D:/Word Sorter/Classifier.py", line 31, in <module>
output_file.write(text + os.linesep)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 161-162: ordinal not in range(128)