0

I am playing with the Kaggle Star Trek Scripts dataset but I am struggling with converting from json to a dataframe in R. Ideally I would convert it in a long form dataset with index columns for episodes and characters with their lines on individual rows. I did find this answer, however it is not in R.

Currently the json looks like the photo below. Sorry it is not a full exmaple, but I put a small mocked version below as well. If you want you can download the data yourselves from here.

Current JSON View enter image description here

Mock Example

  "ENT": {
    "episode_0": {
      "KLAANG": {
        "\"Pungghap! Pung ghap!\"": {},
        "\"DujDajHegh!\"": {}
      }
    },
    "eipsode_1": {
      "ARCHER": {
        "\"Warpme!\"": {},
        "\"Toboldly go!\"": {}
      }
    }
  }
}

The issue I have is that the second level, epsiodes, are individually numbered. So my regular bag of tricks for flattening by attribute name are not working. I am unsure how to loop through a level rather than an attribute name.

What I would ideally want is a long form data set that looks like this:

Series  Episode     Character   Lines
ENT     episode_0   KLAANG      Pung ghap! Pung ghap!
ENT     episode_0   KLAANG      DujDaj Hegh!
ENT     episode_1   ARCHER      Warp me!
ENT     episode_1   ARCHER      To boldly go!

My currnet code looks like the below, which is what I would normally start with, but is obviously not working or far enough.

your_df <- result[["ENT"]] %>%
  purrr::flatten() %>%
  map_if(is_list, as_tibble) %>%
  map_if(is_tibble, list) %>%
  bind_cols()

I have also tried using stack() and map_dfr() but with no success. So I yet again come humbly to you, dear reader, for expertise. Json is the bane of my existance. I struggle with applying other answers to my circumstance so any advice or examples I can reverse engineer and lear from are most appreciated.

Also happy to clarify or expand on anything if possible.

-Jake

1 Answers1

0

So I was able to brute force it thanks to an answer from Michael on this tread called How to flatten a list of lists? so shout out to them.

The function allowed me to covert JSON to a list of lists.

 flattenlist <- function(x){  
  morelists <- sapply(x, function(xprime) class(xprime)[1]=="list")
  out <- c(x[!morelists], unlist(x[morelists], recursive=FALSE))
  if(sum(morelists)){ 
    Recall(out)
  }else{
    return(out)
  }
}

So Putting it all together I ended up with the following solution. Annotation for your entertainment.

library(jsonlite)
library(tidyverse)
library(dplyr)
library(data.table)
library(rjson)

result <- fromJSON(file = "C:/Users/jacob/Downloads/all_series_lines.json")

# Mike's function to get to a list of lists
flattenlist <- function(x){  
  morelists <- sapply(x, function(xprime) class(xprime)[1]=="list")
  out <- c(x[!morelists], unlist(x[morelists], recursive=FALSE))
  if(sum(morelists)){ 
    Recall(out)
  }else{
    return(out)
  }
}

# Mike's function applied
final<-as.data.frame(do.call("rbind", flattenlist(result)))

# Turn all the lists into a master data frame and ensure the index becomes a column I can separate later for context.
final <- cbind(Index_Name = rownames(final), final)
rownames(final) <- 1:nrow(final)

# So the output takes the final elements at the end of the JSON and makes those the variables in a dataframe so I need to force it back to a long form dataset.
final2<-gather(final,"key","value",-Index_Name)

# I separate each element of index name into my three mapping variables; Series,Episode and Character. I can also keep the original column names from above as script line id
final2$Episode<-gsub(".*\\.(.*)\\..*", "\\1", final2$Index_Name)
final2$Series<-substr(final2$Index_Name, start = 1, stop = 3)
final2$Character<-sub('.*\\.'," ", final2$newColName)