One of the files (title.principals) available on IMDb dataset files contains details about cast and crew. I would like to extract Directors details and merge them into single line, as there can be several Directors per film. Is it possible?
#title.principals file download
url <- "https://datasets.imdbws.com/title.principals.tsv.gz"
tmp <- tempfile()
download.file(url, tmp)
#file load
title_principals <- readr::read_tsv(
file = gzfile(tmp),
col_names = TRUE,
quote = "",
na = "\\N",
progress = FALSE
)
#name.basics file download
url <- "https://datasets.imdbws.com/name.basics.tsv.gz"
tmp <- tempfile()
download.file(url, tmp)
#file load
name_basics <- readr::read_tsv(
file = gzfile(tmp),
col_names = TRUE,
quote = "",
na = "\\N",
progress = FALSE
)
#extract directors data
df_directors <- title_principals %>%
filter(str_detect(category, "director")) %>%
select(tconst, ordering, nconst, category) %>%
group_by(tconst)
df_directors <- df_directors %>% left_join(name_basics)
head(df_directors, 20)
I'm joining it with name_basics file to have Director name. Name basics contains Name, birth and death year, profession. And after this step, I would like to merge all Directors per film into single cell split by comma for example.
Is it somehow possible?