0

So, this is the situation: I have a CSV file who looks like this:

show_id   title      cast
1         Batman     Robert Pattinson, Collin Farrel, Zoë Kravitz
2         Twilight   Robert Pattinson, Kristen Stewart
3         Ava        Jessica Chastain, Collin Farrel

What I need to do is open this CSV in a Python function, do some stuff to fix white spaces and that.

Then I need to upload that in a SQL database, (whatever I want, but I choose MySQL), no problem with that.

PROBLEM

My main issue is that then I need (because mi challenge said that) to create a Query to count how many times an actor appears in all the movies in the list. So, in this case the query should display something like this:

Actor               Amount_of_movies
Robert Pattinson    2
Collin Farrel       2
Zoë Kravitz         1
Kristen Stewart     1
Jessica Chastain    1

As you can see I don't have a name to search with LIKE or CONTAINS. So, how can I do that? Because, in the CSV, the list of actors per movie has, well, more than 1 actor, and I save them in a varchar or text type in the database, so each row has more than one actor.

Or should I create another table with actors and foreing keys to movies? Or is something than I can't do in MySQL but in other SQL is possible?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
luca_leiva
  • 31
  • 4
  • hint .. `COUNT()`, `GROUP BY` and `HAVING` are your friends here. Something like `SELECT actor, COUNT(in_movie) as amount_of_movies FROM table GROUP BY actor HAVING COUNT(in_movie) > 0` – Zak Feb 25 '22 at 18:58
  • well yes, but my issue is that there is more than one actor in each column row, for example "Robert Pattinson, Collin Farrell", so wouldn´t your solution count BOTH Robert and Collin as one name (not sure if I explain myself)? Can I "split" the varchar/text in someway? – luca_leiva Feb 25 '22 at 19:08
  • 1
    you should urgently read https://stackoverflow.com/questions/3653462/is-storing-a-delimited-list-in-a-database-column-really-that-bad – nbk Feb 25 '22 at 19:12
  • yea I already imagine something like that trying to solve this, but well I'm quite junior and it is a challenge i got. Thanks for the info – luca_leiva Feb 25 '22 at 19:17

1 Answers1

0

If you look for something performance efficient you should rather split the data (create 3 tables in total - movies, actors, casts) and connect actors with movies using the casts, then write simple sql with some joins like:

Select actors.name as Actor, count(movies.title) as Amount_of_movies from actors,
inner join cast on cast.actor_id = actors.actor_id
inner join movies on movies.movie_id = cast.movie_id;

You are able to do it another way using https://sebhastian.com/mysql-split-string/ or using psql / plsql and stored procedure. I would just split the data if it's possible.

Cheers.

Gregu
  • 16
  • 1
  • 2
    that is called normalizing the data – nbk Feb 25 '22 at 19:13
  • 1
    Gregu the second solution worked! This was mi final Query: select * from (select SUBSTRING_INDEX(cast, ',', 1) as names , count(*) as counter from disney_plus_titles group by names order by counter desc) as subquery where subquery.names <> 'NaN' limit 1. The last part was because I needed also the most repeated actor without being the default 'NaN' – luca_leiva Feb 26 '22 at 01:15
  • Good job, but remember that data normalization is almost always better idea. – Gregu Feb 28 '22 at 09:16