2

I have a column in a DataFrame (production_company) which has a list of strings that are production companies for a movie. I want to search for all unique occurrence of a production company across all movies.

In the data below I have given a sample of the column values in production_company.

"['Universal Studios', 'Amblin Entertainment', 'Legendary Pictures', 'Fuji Television Network', 'Dentsu']"
"['Village Roadshow Pictures', 'Kennedy Miller Productions']"
"['Summit Entertainment', 'Mandeville Films', 'Red Wagon Entertainment', 'NeoReel']"
"['Lucasfilm', 'Truenorth Productions', 'Bad Robot']"
"['Universal Pictures', 'Original Film', 'Media Rights Capital', 'Dentsu', 'One Race Films']"
"['Regency Enterprises', 'Appian Way', 'CatchPlay', 'Anonymous Content', 'New Regency Pictures']"

I am trying to first flatten the column using a solution to flatten given in Pandas Series of lists to one series

But I get error 'TypeError: 'float' object is not iterable'

 17 slist =[]
 18 for company in production_companies:
---> 19     slist.extend(company )
 20 
 21 

TypeError: 'float' object is not iterable

production_companies holds the column df['production_company']

Company is a list so why is it taking it as float? Even list comprehension gives the same error: flattened_list = [y for x in production_companies for y in x]

jpp
  • 159,742
  • 34
  • 281
  • 339
  • Hi Sujit, the list comprehension should be `flattened_list = [y for y in x for x in production companies]` although this would just give you individual characters of each string, so I do not know what you would accomplish with this. – d_kennetz Aug 22 '18 at 14:02

1 Answers1

1

You can use collections.Counter to count items. I would split the task into 3 steps:

  1. Convert series of strings into a series of lists via ast.literal_eval.
  2. Use itertools.chain to form an iterable of companies and feed to Counter.
  3. Use a dictionary comprehension to filter for companies with a count of 1.

Here's a demo:

from ast import literal_eval
from itertools import chain
from collections import Counter

s = df['companies'].map(literal_eval)
c = Counter(chain.from_iterable(s))
c_filtered = {k for k, v in c.items() if v == 1}

Result:

print(c_filtered)

['Village Roadshow Pictures', 'Kennedy Miller Productions', 
 ...
 'Truenorth Productions', 'Regency Enterprises']
jpp
  • 159,742
  • 34
  • 281
  • 339
  • Thank you jpp. I tried but getting error at s = companies.map(literal_eval) ValueError: malformed node or string: ['Ingenious Film Partners', 'Twentieth Century Fox Film Corporation', 'Dune Entertainment', 'Lightstorm Entertainment'] ... companies = df ['companies'] – Sujit Sarkar Aug 22 '18 at 16:19
  • Sorry, can't reproduce. I took the data you provided for my solution. – jpp Aug 22 '18 at 16:22
  • Initial column has | separated values df2['production_companies'].head(20) Ingenious Film Partners|Twentieth Century Fox ... Lucasfilm|Twentieth Century Fox Film Corporation Paramount Pictures|Twentieth Century Fox Film .. I did a str.split df2['production_companies'].str.split('|').head(20) [Ingenious Film Partners, Twentieth Century Fo... [Lucasfilm, Twentieth Century Fox Film Corpora...[Paramount Pictures, Twentieth Century Fox Fil... [Warner Bros., Hoya Productions] My requirement to the parse this – Sujit Sarkar Aug 23 '18 at 12:42