R - IMDb dataset files - how to merge lines per film

Question

One of the files (title.principals) available on IMDb dataset files contains details about cast and crew. I would like to extract Directors details and merge them into single line, as there can be several Directors per film. Is it possible?

#title.principals file download
url <- "https://datasets.imdbws.com/title.principals.tsv.gz"
tmp <- tempfile()
download.file(url, tmp)

#file load
title_principals <- readr::read_tsv(
  file = gzfile(tmp),
  col_names = TRUE, 
  quote = "",
  na = "\\N",
  progress = FALSE
)

#name.basics file download
url <- "https://datasets.imdbws.com/name.basics.tsv.gz"
tmp <- tempfile()
download.file(url, tmp)

#file load
name_basics <- readr::read_tsv(
  file = gzfile(tmp),
  col_names = TRUE, 
  quote = "",
  na = "\\N",
  progress = FALSE
)

#extract directors data
df_directors <- title_principals %>%
  filter(str_detect(category, "director")) %>%
  select(tconst, ordering, nconst, category) %>%
  group_by(tconst)

df_directors <- df_directors %>% left_join(name_basics)

head(df_directors, 20)

I'm joining it with name_basics file to have Director name. Name basics contains Name, birth and death year, profession. And after this step, I would like to merge all Directors per film into single cell split by comma for example.

Is it somehow possible?

What are the contents of `name_basics`? Do you need https://stackoverflow.com/questions/15933958/collapse-concatenate-aggregate-a-column-to-a-single-comma-separated-string-w ? — Ronak Shah, Sep 06 '19 at 08:51
To merge all Directors per film where is the film name in `df_directors` ? — Ronak Shah, Sep 06 '19 at 10:33
There is no film name. There is ID tconst. Title will be added after merge with directors from another file. — Supek, Sep 06 '19 at 10:38

score 0 · Answer 1 · answered Sep 06 '19 at 17:21

Please see this guide for minimal reproducible example. Setting up a simplified example with fake data that highlights the exact problem will help other people help you faster.

As I understand it, you want to take a file that has multiple rows per value of ID_tconst with different values of Director_Name and collapse it to a file with one row per value of ID_tconst and a comma separated list of Director_Names.

Here is a simple mock data set and solution. Note the use of the collapse argument in paste instead of sep.

library(tidyverse)
example <- tribble(
  ~ID_tconst, ~Director_Name, 
  1, "Aaron",
  2, "Bob",
  2, "Cathy",
  3, "Doug",
  3, "Edna",
  3, "Felicty"
)

collapsed <- example %>% 
  group_by(ID_tconst) %>% 
  summarize(directors = paste(Director_Name, collapse = ","))

R - IMDb dataset files - how to merge lines per film

1 Answers1