Unable to extract unique words from a CSV

Question

I have a csv which looks like below:

    Description
0   ['boy']
1   ['boy', 'jumped', 'roof']
2   ['paris']
3   ['paris', 'beautiful', 'new', 'york']
4   ['lets', 'go', 'party']
5   ['refused', 'come', 'party']

I need to find out unique words from this data. So output would be:

    Unique Words
0   boy
1   jumped
2   roof
3   paris
4   beautiful
5   new
6   york

as so on. I am trying to do this using Pandas and Python and unable to achieve it. My code is:

df = pd.read_csv('output.csv')
list(set(df.Description))
g = list(df['Description'].unique())
print(g)

This throws wrong output, it just throws the original csv dataframe.

I would suspect that your dataframe's column contains either lists whose elements are those words or that it contains those lists in string format, that is one long string with all words per row. Either way it would do the `unique` operation on those lists/full strings instead of the individual strings. — Khris, Sep 08 '16 at 07:21

jezrael · Answer 1 · 2016-09-08T08:05:32.280

You can first need convert string column to list, I use ast.literal_eval. Then make flat list of lists by list comprehension, use set and last create new DataFrame by constructor:

import ast

print (type(df.ix[0, 'Description']))
<class 'str'>

df.Description = df.Description.apply(ast.literal_eval)

print (type(df.ix[0, 'Description']))
<class 'list'>

#http://stackoverflow.com/q/952914/2901002
unique_data = list(set([item for sublist in df.Description.tolist() for item in sublist]))
print (unique_data)
['refused', 'jumped', 'go', 'roof', 'come', 'beautiful',
 'paris', 'york', 'lets', 'new', 'boy', 'party']

print (pd.DataFrame({'Unique Words': unique_data}))
   Unique Words
0       refused
1        jumped
2            go
3          roof
4          come
5     beautiful
6         paris
7          york
8          lets
9           new
10          boy
11        party

Another solution without ast:

df.Description = df.Description.str.strip('[]').str.split(',')
print (df)
                                Description
0                                   ['boy']
1               ['boy',  'jumped',  'roof']
2                                 ['paris']
3  ['paris',  'beautiful',  'new',  'york']
4                 ['lets',  'go',  'party']
5            ['refused',  'come',  'party']

unique_data = list(set([item.strip().strip("'") for sublist in df.Description.tolist() for item in sublist]))
print (unique_data)
['refused', 'jumped', 'go', 'roof', 'come', 'beautiful', 
'paris', 'york', 'lets', 'new', 'boy', 'party']

print (pd.DataFrame({'Unique Words': unique_data}))
   Unique Words
0       refused
1        jumped
2            go
3          roof
4          come
5     beautiful
6         paris
7          york
8          lets
9           new
10          boy
11        party

Your solution without `ast` is the fastest. – Khris Sep 08 '16 at 08:50 — Khris, Sep 08 '16 at 08:50

Khris · Answer 2 · 2016-09-08T08:50:00.017

This approach works:

import pandas as pd
import ast

test = {'Description':["['boy']","['boy', 'jumped', 'roof']","['paris']",\
 "['paris', 'beautiful', 'new', 'york']","['lets', 'go', 'party']",\
 "['refused', 'come', 'party']"]}

tt = pd.DataFrame(test)

listOfWords = []
for i,row in tt.iterrows():
  listOfWords.extend(ast.literal_eval(tt.ix[i,'Description']))
uniqueWords = pd.DataFrame(listOfWords,columns=['Unique Words']).drop_duplicates()

If you want it sorted:

uniqueWords = uniqueWords.sort_values('Unique Words')

You iterate over all rows, convert your strings to lists and gather all those lists into one long list with extend. Then just make a new DataFrame from that list and drop the duplicates.

EDIT: Thanks to jezrael for correcting my solution, I borrowed the ast.literal_eval approach from his solution.

I tried to compare our solutions using the %timeit command but got ValueError: malformed string on ast.literal_eval in both solutions.

EDIT2: jezrael's solution is twice as fast for the small data example we have here.

EDIT3: I can't test with a large data example (multiply the given one by some number) because timeit keeps throwing malformed string errors for reasons unclear to me.

EDIT4: Made it work somehow. For a larger dataset (6000 rows) jezrael's solution is over 8 times faster. Guess even iterating with iterrows is rather slow compared to list comprehensions. Also I tested jezrael's second solution without ast. It's more than twice as fast as his first solution.

Only there is problem OP read `csv`, so `type` of values of `input df` are `strings`, which looks like `lists`. — jezrael, Sep 08 '16 at 07:54
You're right, I'm going to correct it shortly and compare out solutions. — Khris, Sep 08 '16 at 08:00

Unable to extract unique words from a CSV

2 Answers2