if you have strings in column then you would have to split every sentence into list of words and then put all list in one list - you can use it sum()
for this - it should give you all words. To get unique words you can convert it to set()
- and later you can convert back to list()
But at start you would have to clean sentences to remove chars like .
, ?
, etc. I uses regex
to keep only some chars and space. Eventually you would have to convert all words into lower or upper case.
import pandas as pd
df = pd.DataFrame({
'sentences': [
'is so sad for my apl friend.',
'omg this is terrible.',
'what is this new song?',
]
})
unique = set(df['sentences'].str.replace('[^a-zA-Z ]', '').str.lower().str.split(' ').sum())
print(list(sorted(unique)))
Result
['apl', 'for', 'friend', 'is', 'my', 'new', 'omg', 'sad', 'so', 'song', 'terrible', 'this', 'what']
EDIT: as @HenryYik mentioned in comment - findall('\w+')
can be used instead of split()
but also instead of replace()
unique = set(df['sentences'].str.lower().str.findall("\w+").sum())
EDIT: I tested it with data from
http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/
All works fast except column.sum()
or sum(column)
- I measured time for 1000 rows and calculated for 1 500 000 rows and it would need 35 minutes.
Much faster is to use itertools.chain()
- it would need about 8 seconds.
import itertools
words = df['sentences'].str.lower().str.findall("\w+")
words = list(itertools.chain(words))
unique = set(words)
but it can be converterd to set()
directly.
words = df['sentences'].str.lower().str.findall("\w+")
unique = set()
for x in words:
unique.update(x)
and it takes about 5 seconds
Full code:
import pandas as pd
import time
print(time.strftime('%H:%M:%S'), 'start')
print('-----')
#------------------------------------------------------------------------------
start = time.time()
# `read_csv()` can read directly from internet and compressed to zip
#url = 'http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip'
url = 'SentimentAnalysisDataset.csv'
# need to skip two rows which are incorrect
df = pd.read_csv(url, sep=',', dtype={'ItemID':int, 'Sentiment':int, 'SentimentSource':str, 'SentimentText':str}, skiprows=[8835, 535881])
end = time.time()
print(time.strftime('%H:%M:%S'), 'load:', end-start, 's')
print('-----')
#------------------------------------------------------------------------------
start = end
words = df['SentimentText'].str.lower().str.findall("\w+")
#df['words'] = words
end = time.time()
print(time.strftime('%H:%M:%S'), 'words:', end-start, 's')
print('-----')
#------------------------------------------------------------------------------
start = end
unique = set()
for x in words:
unique.update(x)
end = time.time()
print(time.strftime('%H:%M:%S'), 'set:', end-start, 's')
print('-----')
#------------------------------------------------------------------------------
print(list(sorted(unique))[:10])
Result
00:27:04 start
-----
00:27:08 load: 4.10780930519104 s
-----
00:27:23 words: 14.803470849990845 s
-----
00:27:27 set: 4.338541269302368 s
-----
['0', '00', '000', '0000', '00000', '000000000000', '0000001', '000001', '000014', '00004873337e0033fea60']