I am working on a research project, using the IMDb dataset as my source of secondary data. I downloaded the entire database in text format from .ftp servers provided by IMDb itself, and used the IMDbPY python package to compile all of the unsorted information into a relational database. I chose to use SQLite as my SQL engine, as it seemed like the least cumbersome option thanks to its ability to create locally-stored databases. After a bit of poking around and a lot of documentation-reading, I ended up with a 9.04 GB im.db file, hosting the entirety of IMDb.
Now I need to isolate my dataset according to my requirements, but due to my lack of experience with SQL I'm finding it difficult to figure out the most optimal way of doing so.
Specifically, I want to look at:
- Movies only (i.e. exclude TV series, episodes, etc.);
- Produced in the period between 2000-2015, inclusive;
- Of feature length (i.e. running time over 40 minutes);
- Non-adult (I didn't even know IMDb hosts information on these, but apparently so);
- Produced in the USA;
- With complete information on crew.
Here's a representation of my database schema. I was confused by some of the database design choices that IMDbPY creators made, but I'm no SQL expert, and this is what I get to work with. Some clarifications:
- The title table holds basic information about every instance of films, shows, episodes, and so on, 3,673,485 rows in total. The id column is an auto-incremented primary key, which is referenced as the movie_id foreign key in all other relevant tables. However, it seems like that none of the foreign keys in other tables are indexed properly, so I can't use simple query statements to properly get necessary information just by knowing a particular film's id value.
- Running
SELECT count(*) FROM title WHERE kind_id=1 AND production_year BETWEEN 2000 AND 2015;
tells me that there are 442,135 instances of movies, produced between 2000-2015. So far so good. - The complete_cast and comp_cast_type tables hold info about the completion status of a film's crew/cast list. Since I only need to consider films with complete crew information, I need to isolate only those instances, where (i) movie_id exists in my previous query (i.e. out of the 442,135 movie rows); (ii) subject_id=2; and (iii) status_id=3 or 4.
This is where it gets tricky for me. The movie_info table holds 20 million rows of information about films and TV shows, including runtimes, genres, countries of production, years of production, etc. Basically all of the information that I need to isolate my dataset. Within that table (i) id is an arbitrary auto-incremented primary key; (ii) movie_id refers to the id values from title; (iii) info_type_id refers to one of the 113 types of information as listed in the table info_type; (iv) info holds the actual information, as integers or strings.
For example: Running
SELECT id FROM title WHERE title='2001: A Space Odyssey' AND kind_id=1;
returns '2484213'. RunningSELECT info FROM movie_info WHERE movie_id=2484213 AND info_type_id=1;
returns '142, 161, 149', indicating the running times in minutes of the three available versions of the film. RunningSELECT info FROM movie_info WHERE movie_id=2484213 AND info_type_id=8
returns 'USA, UK', indicating the countries involved in production. And so on.
Basically I'm trying to create a new table, populated only with films that fall under my requirements, and I'm having a hard time figuring out the most efficient way of doing so. Here's how I translated my requirements into basic SQL syntax:
SELECT * FROM title WHERE kind_id=1 AND production_year BETWEEN 2000 AND 2015;
- Then a bunch of requirements from the movie_info table, which cross-references only those instances, where movie_id exists as id in the query above, and (i)
info_type_id=1 AND info>40;
(ii)info_type_id=3 AND info!='Adult';
(iii)info_type_id=8 AND info='USA';
- Finally, I need to make sure that all of the selections exist in the complete_cast table, and
WHERE subject_id=2 AND status_id=3 OR 4;
I've been reading SQLite documentation, and suspect that I need to use some combination of INNER/LEFT OUTER JOIN, EXISTS and UNION/INTERSECT/EXCEPT statements, but not sure how to approach this exactly. I would like to write this code efficiently, since brute-forcing queries requirement by requirement takes a while for my computer to process. Thank you in advance for your help.
TL;DR. I can't figure out an efficient way of using INNER/LEFT OUTER JOIN, EXISTS and UNION/INTERSECT/EXCEPT statements to help me isolate a smaller dataset in accordance with multiple requirements, to satisfy which I need to cross-query a number of existing tables without properly indexed foreign keys.