Tokenizing a list of lists

Question

I'm trying to tokenize a csv file of scrapped tweets. I uploaded the csv files as lists

with open('recent_tweet_purex.csv', 'r') as purex:
reader_purex = csv.reader(purex)
purex_list = list(reader_purex)

now the tweets are in lists as such

["b'I miss having someone to talk to all night..'"], ["b'Pergunte-me 
qualquer coisa'"], ["b'RT @Caracolinhos13: Tenho a 
tl cheia dessa merda de quem vos visitou nas \\xc3\\xbaltimas horas'"], 
["b'RT @B24pt: #CarlosHadADream'"], ['b\'"Tudo tem 
um fim"\''], ["b'RT @thechgama: stalkear as curtidas \\xc3\\xa9 um caminho 
sem volta'"], ["b'Como consegues fumar 3 purexs seguidas? \\xe2\\x80\\x94 
Eram 2 purex e mix...'"]

I have nltk imported as well as with the following packages

 from nltk.tokenize import word_tokenize
 import string
 from nltk.corpus import stopwords
 from nltk.stem import WordNetLemmatizer
 from nltk.tokenize import sent_tokenize
 nltk.download('punkt')

I tried using

 purex_words = word_tokenize(purex_words)

to to tokenize but I keep getting errors

Any help?

It will be more helpful if you tell us the errors you're getting. — Turn, Feb 08 '18 at 04:46
NameError Traceback (most recent call last) in () ----> 1 purex_words = word.tokenize(purex_list) NameError: name 'word' is not defined — CJ090, Feb 08 '18 at 04:53
@user9270834 error says you are calling `word.tokenize` not `word_tokenize`. You have a syntax error. Change the dot (**.**) to underscore (**_**). — umutto, Feb 08 '18 at 04:54
the error for word_tokenize(purex_list) is TypeError: expected string or bytes-like object — CJ090, Feb 08 '18 at 04:58

Guiem Bosch · Answer 1 · 2018-02-08T05:46:32.663

0

You are passing arrays to the word_tokenize function and it expects string or bytes-like object. If you feed it with strings it will work. Quick example.

purex_words = [['I miss having someone to talk to all night..'], ['Pergunte-me qualquer coisa'],

['RT @Caracolinhos13: Tenho a tl cheia dessa merda de quem vos visitou nas \xc3\xbaltimas horas'], ['RT @B24pt: #CarlosHadADream'], ["Tudo tem um fim"], ["RT @thechgama: stalkear as curtidas \xc3\xa9 um caminho sem volta"], ['Como consegues fumar 3 purexs seguidas? \xe2\x80\x94 Eram 2 purex e mix...']]

for sentence in purex_words:
    print(word_tokenize(sentence[0])) # this looks ugly to me

You could flatten the list before looping over the sentences. Note that I added an external [] to your lists.

flat_list = [item for sublist in purex_words for item in sublist]
for sentence in flat_list:
    print(word_tokenize(sentence))

The result looks something like this.

['I', 'miss', 'having', 'someone', 'to', 'talk', 'to', 'all', 'night..']
['Pergunte-me', 'qualquer', 'coisa']
['RT', '@', 'Caracolinhos13', ':', 'Tenho', 'a', 'tl', 'cheia', 'dessa', 'merda', 'de', 'quem', 'vos', 'visitou', 'nas', '\\xc3\\xbaltimas', 'horas']
['RT', '@', 'B24pt', ':', '#', 'CarlosHadADream']
['Tudo', 'tem', 'um', 'fim']
['RT', '@', 'thechgama', ':', 'stalkear', 'as', 'curtidas', '\\xc3\\xa9', 'um', 'caminho', 'sem', 'volta']
['Como', 'consegues', 'fumar', '3', 'purexs', 'seguidas', '?', '\\xe2\\x80\\x94', 'Eram', '2', 'purex', 'e', 'mix', '...']

edited Feb 08 '18 at 05:46

answered Feb 08 '18 at 05:31

Guiem Bosch

2,728
1
21
37

That's progress, thanks. The tweets are now all in their own list with each word tokenized how would I aggregate all those tokenized words into one list? – CJ090 Feb 08 '18 at 05:41
Also, I realized you have some extra characters that might be not necessary, maybe because you were trying to avoid the error? I'm referring to the excess of quote symbols `["b'Pergunte-me qualquer coisa'"]`. The previous could be simplified as `['Pergunte-me qualquer coisa']`. No need for `"b`. – Guiem Bosch Feb 08 '18 at 05:48
The last issue is how to combine those lists into one list without the "b – CJ090 Feb 08 '18 at 06:22
what do you mean? In the `flat_list` for example? It's redundant, but you could do `flat_list = [str(item) for sublist in purex_words for item in sublist]` . This gets rid of `b`. Which Python version are you using btw, 2 or 3? For further information you can check [this](https://stackoverflow.com/questions/6269765/what-does-the-b-character-do-in-front-of-a-string-literal), but that wouldn't be in the scope of the original question =) – Guiem Bosch Feb 08 '18 at 12:53
Oh, I see in your specific case you have nested quotes in quotes, something like this `"b'` at the beginning and `' "` in the end. I'm sure there is a better way to do this, but a quick way to get rid of it, given that it's always the same format, to substring the part you want to get `flat_list = [item[2:-1] for sublist in purex_words for item in sublist] `. But it's not very elegant, I would definitely check the way you obtain the text, the way you store it and the way you read it again from your `.csv`! – Guiem Bosch Feb 08 '18 at 13:05

Tokenizing a list of lists

1 Answers1