How can I get unique words from a DataFrame column of strings?

Question

I'm looking for a way to get a list of unique words in a column of strings in a DataFrame.

import pandas as pd
import numpy as np

df = pd.read_csv('FinalStemmedSentimentAnalysisDataset.csv', sep=';',dtype= 
       {'tweetId':int,'tweetText':str,'tweetDate':str,'sentimentLabel':int})

tweets = {}
tweets[0] = df[df['sentimentLabel'] == 0]
tweets[1] = df[df['sentimentLabel'] == 1]

the dataset I'm using is from this link: http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/

I got this column with strings of variable length, and i want to get the list of every unique word in the column and its count, how can i get it? I'm using Pandas in python and the original database has more then 1M rows so i also need some efective way to process this fast enough and not make the code be running for too long.

a Example of column could be:

is so sad for my apl friend.
omg this is terrible.
what is this new song?

And the list could be something like.

[is,so,sad,for,my,apl,friend,omg,this,terrible,what,new,song]

Does this answer your question? [Python Pandas : How to compile all lists in a column into one unique list](https://stackoverflow.com/questions/38895856/python-pandas-how-to-compile-all-lists-in-a-column-into-one-unique-list) — It_is_Chris, Nov 23 '19 at 15:59
Thank you for your question. Please include the code instead of an image. — jeffhale, Nov 23 '19 at 16:26
@Chris i tried the things they say in the post but doesn't seem to work for me. — Panda.V5, Nov 23 '19 at 19:16

furas · Accepted Answer · 2019-11-24T00:13:32.433

if you have strings in column then you would have to split every sentence into list of words and then put all list in one list - you can use it sum() for this - it should give you all words. To get unique words you can convert it to set() - and later you can convert back to list()

But at start you would have to clean sentences to remove chars like ., ?, etc. I uses regex to keep only some chars and space. Eventually you would have to convert all words into lower or upper case.

import pandas as pd

df = pd.DataFrame({
    'sentences': [
        'is so sad for my apl friend.',
        'omg this is terrible.',
        'what is this new song?',
    ]
})

unique = set(df['sentences'].str.replace('[^a-zA-Z ]', '').str.lower().str.split(' ').sum())

print(list(sorted(unique)))

Result

['apl', 'for', 'friend', 'is', 'my', 'new', 'omg', 'sad', 'so', 'song', 'terrible', 'this', 'what']

EDIT: as @HenryYik mentioned in comment - findall('\w+') can be used instead of split() but also instead of replace()

unique = set(df['sentences'].str.lower().str.findall("\w+").sum())

EDIT: I tested it with data from

http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/

All works fast except column.sum() or sum(column) - I measured time for 1000 rows and calculated for 1 500 000 rows and it would need 35 minutes.

Much faster is to use itertools.chain() - it would need about 8 seconds.

import itertools

words = df['sentences'].str.lower().str.findall("\w+")
words = list(itertools.chain(words))
unique = set(words)

but it can be converterd to set() directly.

words = df['sentences'].str.lower().str.findall("\w+")

unique = set()

for x in words:
    unique.update(x)

and it takes about 5 seconds

Full code:

import pandas as pd
import time 

print(time.strftime('%H:%M:%S'), 'start')

print('-----')
#------------------------------------------------------------------------------

start = time.time()

# `read_csv()` can read directly from internet and compressed to zip
#url = 'http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip'
url = 'SentimentAnalysisDataset.csv'

# need to skip two rows which are incorrect
df = pd.read_csv(url, sep=',', dtype={'ItemID':int, 'Sentiment':int, 'SentimentSource':str, 'SentimentText':str}, skiprows=[8835, 535881])

end = time.time()
print(time.strftime('%H:%M:%S'), 'load:', end-start, 's')

print('-----')
#------------------------------------------------------------------------------

start = end

words = df['SentimentText'].str.lower().str.findall("\w+")
#df['words'] = words

end = time.time()
print(time.strftime('%H:%M:%S'), 'words:', end-start, 's')

print('-----')
#------------------------------------------------------------------------------

start = end

unique = set()
for x in words:
    unique.update(x)

end = time.time()
print(time.strftime('%H:%M:%S'), 'set:', end-start, 's')

print('-----')
#------------------------------------------------------------------------------

print(list(sorted(unique))[:10])

Result

00:27:04 start
-----
00:27:08 load: 4.10780930519104 s
-----
00:27:23 words: 14.803470849990845 s
-----
00:27:27 set: 4.338541269302368 s
-----
['0', '00', '000', '0000', '00000', '000000000000', '0000001', '000001', '000014', '00004873337e0033fea60']

or `df["sentence"].str.findall("\w+")` to skip the split part. — Henry Yik, Nov 23 '19 at 16:30
@furas i tried your solution but doesn't seem to work it stay ina infinitie loop, can't figure out why, i editted the question with some extra info about how i processa data, and the strings are already processed without any punctuation and all in lower case. — Panda.V5, Nov 23 '19 at 18:58
what loop ? I don't use any loop, you don't show any loop in question - so where is loop ? Maybe better describe problem or show more code. — furas, Nov 23 '19 at 21:49
I added code which runs much faster for your file with 1 500 000 rows. — furas, Nov 23 '19 at 23:21

How can I get unique words from a DataFrame column of strings?

1 Answers1