12

This question is about a generic mechanism for converting any collection of non-cyclical homogeneous or heterogeneous data structures into a dataframe. This can be particularly useful when dealing with the ingestion of many JSON documents or with a large JSON document that is an array of dictionaries.

There are several SO questions that deal with manipulating deeply nested JSON structures and turning them into dataframes using functionality such as plyr, lapply, etc. All the questions and answers I have found are about specific cases as opposed to offering a general approach for dealing with collections of complex JSON data structures.

In Python and Ruby I've been well-served by implementing a generic data structure flattening utility that uses the path to a leaf node in a data structure as the name of the value at that node in the flattened data structure. For example, the value my_data[['x']][[2]][['y']] would appear as result[['x.2.y']].

If one has a collection of these data structures that may not be entirely homogeneous the key to doing a successful flattening to a dataframe would be to discover the names of all possible dataframe columns, e.g., by taking the union of all keys/names of the values in the individually flattened data structures.

This seems like a common pattern and so I'm wondering whether someone has already built this for R. If not, I'll build it but, given R's unique promise-based data structures, I'd appreciate advice on an implementation approach that minimizes heap thrashing.

Sim
  • 13,147
  • 9
  • 66
  • 95
  • Huh? Too much English for me (anyway) to understand. Suggest providing some reproducible input with some (presumably) slow code that produces the output you'd like, and go from there. Maybe it's just that I don't know JSON. Can you provide something pastable into a fresh R session that downloads some JSON data from somewhere to demonstrate your question? [How to make a great reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Matt Dowle Jul 19 '12 at 08:55

5 Answers5

9

Hi @Sim I had cause to reflect on your problem yesterday define:

flatten<-function(x) {
    dumnames<-unlist(getnames(x,T))
    dumnames<-gsub("(*.)\\.1","\\1",dumnames)
    repeat {
        x <- do.call(.Primitive("c"), x)
        if(!any(vapply(x, is.list, logical(1)))){
           names(x)<-dumnames
           return(x)
        }
    }
}
getnames<-function(x,recursive){

    nametree <- function(x, parent_name, depth) {
        if (length(x) == 0) 
            return(character(0))
        x_names <- names(x)
        if (is.null(x_names)){ 
            x_names <- seq_along(x)
            x_names <- paste(parent_name, x_names, sep = "")
        }else{ 
            x_names[x_names==""] <- seq_along(x)[x_names==""]
            x_names <- paste(parent_name, x_names, sep = "")
        }
        if (!is.list(x) || (!recursive && depth >= 1L)) 
            return(x_names)
        x_names <- paste(x_names, ".", sep = "")
        lapply(seq_len(length(x)), function(i) nametree(x[[i]], 
            x_names[i], depth + 1L))
    }
    nametree(x, "", 0L)
}

(getnames is adapted from AnnotationDbi:::make.name.tree)

(flatten is adapted from discussion here How to flatten a list to a list without coercion?)

as a simple example

my_data<-list(x=list(1,list(1,2,y='e'),3))

> my_data[['x']][[2]][['y']]
[1] "e"

> out<-flatten(my_data)
> out
$x.1
[1] 1

$x.2.1
[1] 1

$x.2.2
[1] 2

$x.2.y
[1] "e"

$x.3
[1] 3

> out[['x.2.y']]
[1] "e"

so the result is a flattened list with roughly the naming structure you suggest. Coercion is avoided also which is a plus.

A more complicated example

library(RJSONIO)
library(RCurl)
json.data<-getURL("http://www.reddit.com/r/leagueoflegends/.json")
dumdata<-fromJSON(json.data)
out<-flatten(dumdata)

UPDATE

naive way to remove trailing .1

my_data<-list(x=list(1,list(1,2,y='e'),3))
gsub("(*.)\\.1","\\1",unlist(getnames(my_data,T)))

> gsub("(*.)\\.1","\\1",unlist(getnames(my_data,T)))
[1] "x.1"   "x.2.1" "x.2.2" "x.2.y" "x.3"  
Community
  • 1
  • 1
shhhhimhuntingrabbits
  • 7,397
  • 2
  • 23
  • 23
  • Looks promising. How would you suggest we get rid of the trailing `.1`s? – Sim Aug 12 '12 at 22:46
  • You should be able to reassign `names(flattened_structure)`, right? – Sim Aug 12 '12 at 22:57
  • I agree. Cleaner now. My question was specifically about converting a large JSON document that is an array of dictionaries/hashes to data.frame. For that you'd have to build the column set as the union of all flattened list names, right? – Sim Aug 13 '12 at 00:52
  • Thanks for the example code. I however get- 'names' attribute [71556] must be the same length as the vector [66648] - on a large nested set. – EarlyPoster Aug 14 '13 at 20:05
  • Great answer - I just want to suggest one change in flatten. Currently, you do your flattening (.Primitive("c)) before checking that the list is already flattened. If you provide list that is already flattened, it seems to convert an atomic vector, which loses all type information. I suggest moving the flattening to the end of the repeat loop. – DavidR May 22 '17 at 16:47
4

R has two packages for dealing with JSON input: rjson and RJSONIO. If I understand correctly what you mean by "collection of non-cyclical homogeneous or heterogeneous data structures", I think either of these packages will import that sort of structure as a list.

You can then flatten that list (into a vector) using the unlist function.

If the list is suitably structured (a non-nested list where each element is the same length) then as.data.frame prvoides an alternative to convert the list to be a data frame.

An example:

(my_data <- list(x = list('1' = 1, '2' = list(y = 2))))
unlist(my_data)
Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
  • What's with the downvote? `unlist` seems exactly like the "generic data structure flattening utility" that @Sim wants. In fact, the similar question linked to by @ttmaccer includes answers that make extensive use of `unlist`. – Richie Cotton Jul 19 '12 at 11:15
  • @ttmaccer: Yeah, you can't have it both ways in R. You either a flat (vector) data structure with a single data type or a nested (list) structure with mixed types. I think that there are enough tools in R that any JSON is transformable into whatever you want though. – Richie Cotton Jul 19 '12 at 16:39
  • 1
    @RichieCotton @ttmaccer I agree that `unlist` won't work in a generic way. If this is the best R has out of the box, I'll go ahead and write the recursive descent flattener I've used in other languages. – Sim Jul 28 '12 at 00:03
4

The jsonlite package is a fork of RJSONIO specifically designed to make conversion between JSON and data frames easier. You don't provide any example json data, but I think this might be what you are looking for. Have a look at this blog post or the vignette.

Jeroen Ooms
  • 31,998
  • 35
  • 134
  • 207
  • `fromJSON(path_to_file, flatten=F)` -- then examine the output. It's usually quite logical, although you might need a few `rbindlist()` and other tidying to fully unwrap it. – Ufos May 28 '19 at 11:29
1

Great answer with the flatten and getnames functions. Took a few minutes to figure out all the options needed to get from a vector of JSON strings to a data.frame, so I thought I'd record that here. Suppose jsonvec is a vector of JSON strings. The following builds a data.frame (data.table) where there is one row per string, and each column corresponds to a different possible leaf node of the JSON tree. Any string missing a particular leaf node is filled with NA.

library(data.table)
library(jsonlite)
parsed = lapply(jsonvec, fromJSON, simplifyVector=FALSE)
flattened = lapply(parsed, flatten) #using flatten from accepted answer
d = rbindlist(flattened, fill=TRUE)
DavidR
  • 810
  • 2
  • 8
  • 16
1

I'm now a big fan of simply:

library(jsonlite)
library(tidyverse)

fromJSON("file_path.json") %>% 
unlist() %>% 
enframe()

And then potentially, depending on your data, piping that into

%>%
pivot_wider()

Once it's in a flat table shape, there are a load of tools in tidyverse and other R libraries more generally for wrangling things around and e.g., dealing with columns with similar prefixes (which will result from the above pipeline as the parent name of the children within a nested json chunk will be prefixed to the child's name).

gladys_c_hugh
  • 158
  • 1
  • 9