I am a fairly new user of Python and I work mainly with imported text files, especially csv's, which give me headaches to process. I tried to read the docs like this one : https://docs.python.org/2/howto/unicode.html but I don't understand a clue of what is being said. I just want some straight down-to-earth explanation.
For instance I want to tokenize a large number of verbatims exported from the internet as a csv file. I want to use NLTK's tokenizer to do so.
Here's my code:
with open('verbatim.csv', 'r') as csvfile:
reader = unicode_csv_reader(csvfile, dialect=csv.excel)
for data in reader:
tokens = nltk.word_tokenize(data)
When I do a print() on data I get clean text.
But when I use the tokenizer method, it returns the following error :
'ascii' codec can't decode byte 0xe9 in position 31: ordinal not in range(128)
It looks like an encoding problem. And it's always the same problem with every little manipulation I do with text. Can you help me with this ?