I'm trying to tokenize a csv file of scrapped tweets. I uploaded the csv files as lists
with open('recent_tweet_purex.csv', 'r') as purex:
reader_purex = csv.reader(purex)
purex_list = list(reader_purex)
now the tweets are in lists as such
["b'I miss having someone to talk to all night..'"], ["b'Pergunte-me
qualquer coisa'"], ["b'RT @Caracolinhos13: Tenho a
tl cheia dessa merda de quem vos visitou nas \\xc3\\xbaltimas horas'"],
["b'RT @B24pt: #CarlosHadADream'"], ['b\'"Tudo tem
um fim"\''], ["b'RT @thechgama: stalkear as curtidas \\xc3\\xa9 um caminho
sem volta'"], ["b'Como consegues fumar 3 purexs seguidas? \\xe2\\x80\\x94
Eram 2 purex e mix...'"]
I have nltk imported as well as with the following packages
from nltk.tokenize import word_tokenize
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
nltk.download('punkt')
I tried using
purex_words = word_tokenize(purex_words)
to to tokenize but I keep getting errors
Any help?