pandas getting most frequent names from a column which has list of names

Question

my dataframe is like this

star_rating  actors_list
0   9.3     [u'Tim Robbins', u'Morgan Freeman']
1   9.2     [u'Marlon Brando', u'Al Pacino', u'James Caan']
2   9.1     [u'Al Pacino', u'Robert De Niro']
3   9.0     [u'Christian Bale', u'Heath Ledger']
4   8.9     [u'John Travolta', u'Uma Thurman']

I want to extract the most frequent names in the actors_list column. I found this code. do you have a better suggestion? especially for big data.

import pandas as pd
df= pd.read_table (r'https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/imdb_1000.csv',sep=',')
df.actors_list.str.replace("(u\'|[\[\]]|\')",'').str.lower().str.split(',',expand=True).stack().value_counts()

expected output for (this data)

robert de niro    13
tom hanks         12
clint eastwood    11
johnny depp       10
al pacino         10
james stewart      9

It is always better to use a for loop than giving the heavy work to the pandas itself. — Bharath M Shetty, Dec 12 '18 at 05:12
If you have a huge list then `expand=True` will kill your system. — Bharath M Shetty, Dec 12 '18 at 05:24

cs95 · Accepted Answer · 2018-12-13T03:50:59.937

4

By my tests, it would be much faster to do the regex cleanup after counting.

from itertools import chain
import re

p = re.compile("""^u['"](.*)['"]$""")
ser = pd.Series(list(chain.from_iterable(
    x.title().split(', ') for x in df.actors_list.str[1:-1]))).value_counts()
ser.index = [p.sub(r"\1", x) for x in ser.index.tolist()]


ser.head()

Robert De Niro    18
Brad Pitt         14
Clint Eastwood    14
Tom Hanks         14
Al Pacino         13
dtype: int64

edited Dec 13 '18 at 03:50

answered Dec 12 '18 at 05:30

cs95

379,657
97
704
746

maybe it should better to remove `from collections import Counter` and add `.title()` to `x.title().split(', ')` – Reza energy Dec 13 '18 at 03:48

Bharath M Shetty · Answer 2 · 2018-12-12T05:38:46.687

3

Its always better to go for plain python than depending on pandas since it consumes huge amount of memory if the list is large.

If the list is of size 1000, then the non 1000 length lists will have Nan's when you use expand = True which is a waste of memeory. Try this instead.

df = pd.concat([df]*1000) # For the sake of large df. 

%%timeit
df.actors_list.str.replace("(u\'|[\[\]]|\')",'').str.lower().str.split(',',expand=True).stack().value_counts()
10 loops, best of 3: 65.9 ms per loop

%%timeit     
df['actors_list'] = df['actors_list'].str.strip('[]').str.replace(', ',',').str.split(',')
10 loops, best of 3: 24.1 ms per loop

%%timeit
words = {}
for i in df['actors_list']:
    for w in i : 
        if w in words:
            words[w]+=1
        else:
            words[w]=1

100 loops, best of 3: 5.44 ms per loop

edited Dec 12 '18 at 05:38

answered Dec 12 '18 at 05:35

Bharath M Shetty

30,075
6
57
108

1

Don't forget to time this part too: `df['actors_list'].str.strip('[]').str.replace(', ',',').str.split(',')` – cs95 Dec 12 '18 at 05:37
@Dark: I get `Can only use .str accessor with string values, which use np.object_ dtype in pandas` error for your code. Also where you execute you code. at jupyter notebook and Ipython they don't accept %%. – Reza energy Dec 12 '18 at 06:36
@Rezaenergy Remove the `%%timeit` part and its results, only use the code. And since the datatype is suggesting to be object you can directly run the code starting with `words={}`. – Bharath M Shetty Dec 12 '18 at 06:44
@Dark thank you it without %timeit or %%timeit it works but I don't know when I add %%timeit it will cause this error `Can only use .str accessor with string values, which use np.object_ dtype in pandas` – Reza energy Dec 12 '18 at 17:27

score 3 · Answer 3 · answered Dec 12 '18 at 05:53

3

I will using ast convert the list like to list

import ast 
df.actors_list=df.actors_list.apply(ast.literal_eval)
pd.DataFrame(df.actors_list.tolist()).melt().value.value_counts()

answered Dec 12 '18 at 05:53

BENY

317,841
20
164
234

it shows error `ValueError: malformed node or string: ['Tim Robbins', 'Morgan Freeman', 'Bob Gunton']` – Reza energy Dec 12 '18 at 06:47

score 0 · Answer 4 · answered Dec 12 '18 at 22:04

0

according to this code I got below chart

which

coldspeed's code is wen2()
Dark's code is wen4()
Mine code is wen1()
W-B's code is wen3()

answered Dec 12 '18 at 22:04

Reza energy

135
7

pandas getting most frequent names from a column which has list of names

4 Answers4