4

I have a data.table where one of the columns contains JSON. I am trying to extract the content so that each variable is a column.

library(jsonlite)
library(data.table)

df<-data.table(a=c('{"tag_id":"34","response_id":2}',
                   '{"tag_id":"4","response_id":1,"other":4}',
                   '{"tag_id":"34"}'),stringsAsFactors=F)

The desired result, that does not refer to the "other" variable:

   tag_id response_id 
1     "34"      2 
2     "4"       1 
3     "34"      NA

I have tried several versions of:

parseLog <- function(x){
  if (is.na(x))
    e=c(tag_id=NA,response_id=NA)
  else{
    j=fromJSON(x)
    e=c(tag_id=as.integer(j$tag_id),response_id=j$response_id)
  }
  e
}

that seems to work well to retrieve a list of vectors (or lists if c is replaced by list) but when I try to convert the list to data.table something doesn´t work as expected.

   parsed<-lapply(df$a,parseLog)
   rparsed<-do.call(rbind.data.frame,parsed)
   colnames(rparsed)<-c("tag_id","response_id")

Because of the missing value in the third row. How can I solve it in a R-ish clean way? How can I make that my parse method returns an NA for the missing value. Alternative, Is there a parameter "fill" like there is for rbind that can be used in rbind.data.frame or analogous method?

The dataset I am using has 11M rows so performance is important.

Additionally, there is an equivalent method to rbind.data.frame to obtain a data.table. How would that be used? When I check the documentation it refers me to rbindlist but it complains the parameter is not used and if call directly(without do.call it complains about the type of parsed):

rparsed<-do.call(rbindlist,fill=T,parsed)

EDIT: The case I need to cover is more general, in a set of 11M records all the possible circumstances happen:

df<-data.table(a=c('{"tag_id":"34","response_id":2}',
                   '{"trash":"34","useless":2}',                          
                   '{"tag_id":"4","response_id":1,"other":4}',
                   NA,
                   '{"response_id":"34"}', 
                   '{"tag_id":"34"}'),stringsAsFactors=F)

and the output should only contain tag_id and response_id columns.

Picarus
  • 760
  • 1
  • 10
  • 25

1 Answers1

3

There might be a simpler way but this seems to be working:

library(data.table)
library(jsonlite)
df[, json := sapply(a, fromJSON)][, rbindlist(lapply(json, data.frame), fill=TRUE)]

#or if you need all the columns :
#df[, json := sapply(a, fromJSON)][,
#   c('tag_id', 'response_id') := rbindlist(lapply(json, data.frame), fill=TRUE)]

Output:

> df[, json := sapply(a, fromJSON)][, rbindlist(lapply(json, data.frame), fill=TRUE)]
   tag_id response_id
1:     34           2
2:      4           1
3:     34          NA

EDIT:

This solution comes after the edit of the question with additional requests.

There are lots of ways to do this but I find the simplest one is at the creation of the data.frame like this:

df[, json := sapply(a, fromJSON)][, 
   rbindlist(lapply(json, function(x) data.frame(x)[-3]), fill=TRUE)]

#   tag_id response_id
#1:     34           2
#2:      4           1
#3:     34          NA
LyzandeR
  • 37,047
  • 12
  • 77
  • 87
  • I have edited my example slightly as I see that in the aim of summarizing I have not included some relevant data. I have additional variables in the json that I am not interested in... many of them. Excellent answer though! – Picarus Dec 08 '15 at 22:30
  • Still the solution does not consider that the number of variables to be ignored is high, meaning that the data frame that may be built can be huge. Is there a way to avoid that? Maybe adding the columns that are relevant one by one? Is the order of the variables guaranteed in the json column? Even having the whole "json" column that you use as an intermediate step will require a lot of memory. I know I am being picky, just trying to cover all possible issues. – Picarus Dec 09 '15 at 00:49
  • When I post answers on stackoverflow I do not know of the R-skills or level of each OP. Therefore, I try to be as explicit as possible in order to accommodate most OPs. I simply provide the answer that would solve the particular task in the question. The above answer is written in two steps in order to make sense. I find it very easy to combine the intermediate step with the next step in order to reduce the memory. Basically, just replace `json` in the second step to `sapply(a, fromJSON)`. Also, I used `data.frame(x)[-3])` above but if you have 1000 columns then just replace `-3` with the – LyzandeR Dec 09 '15 at 15:10
  • character vector of the column names you want to keep. I assume this is something easy to do. – LyzandeR Dec 09 '15 at 15:11
  • first, thanks for your help. I take blame for the misunderstanding as the two updates I have done to the question have caused confussion. Second, the idea of selecting for the required columns instead of eliminating the unnecessary doesn't work, as in some cases the column 2 will not exist. The case with NA on "a" column doesn't make things easy either. If I try to replace NA with {} subsequent steps ignore those rows. And yes, the simplification that you refer is easy just wanted to make clear that it was important to be memory efficient if possible. – Picarus Dec 10 '15 at 02:54