1

I have a data frame which consists of a single column of some very messy JSON data. I would like to convert the JSON entries in that column to additional columns in the same data frame, I have a messy solution, but it will be tedious and long to apply it to my actual dataset.

Here is my sample data frame:

sample.df <- data.frame(id = c(101, 102, 103, 104),
                    json_col = c('[{"foo_a":"bar"}]',
                                 '[{"foo_a":"bar","foo_b":"bar"}]',
                                 '[{"foo_a":"bar","foo_c":2}]',
                                 '[{"foo_a":"bar","foo_b":"bar","foo_c":2,"nested_col":{"foo_d":"bar","foo_e":3}}]'),
                    startdate = as.Date(c('2010-11-1','2008-3-25','2007-3-14','2006-2-21')))

in reality my data frame has over 100000 entries and consists of multiple JSON columns where I need to apply the solution to this question, there are also several orders of nested lists (i.e. nested lists within nested lists).

Here is my solution:

j.col <- sample.df[2]
library(jsonlite)
j.l <- apply(j.col, 1, jsonlite::fromJSON, flatten = T)
library(dplyr)
l.as.df <- bind_rows(lapply(j.l,data.frame))
new.df <- cbind(sample.df$id, l.as.df, sample.df$startdate)

My solution is a roundabout method were I extract the column from the data frame with the JSON stuff, and then convert the JSON into a second dataframe, and then I combine the two dataframes into a third datadframe. This will be long and tedious to do with my actual data, not to mention that it is inelegant. How can I do this without having to create the additional data frames?

Thanks in advance for any help!

pd441
  • 2,644
  • 9
  • 30
  • 41
  • Given the messiness of that data I would actually say your approach is pretty good! I'm sure one of the wizards on this site will provide you with something more elegant, though :). – 93i7hdjb Apr 03 '18 at 15:51

3 Answers3

2

Here's another approach that will spare you the intermediate dataframes:

library(dplyr)
library(jsonlite)

new.df <- sample.df %>% 
          rowwise() %>%
          do(data.frame(fromJSON(.$json_col, flatten = T))) %>%
          ungroup() %>%
          bind_cols(sample.df %>% select(-json_col))

print(new.df)
# # A tibble: 4 x 7
#   foo_a foo_b foo_c nested_col.foo_d nested_col.foo_e    id  startdate
#   <chr> <chr> <int>            <chr>            <int> <dbl>     <date>
# 1     _  <NA>    NA             <NA>               NA   101 2010-11-01
# 2     _     _    NA             <NA>               NA   102 2008-03-25
# 3     _  <NA>     2             <NA>               NA   103 2007-03-14
# 4     _     _     2                _                3   104 2006-02-21
cmaher
  • 5,100
  • 1
  • 22
  • 34
  • 1
    Thanks for the solution, it looks great. can I ask why you need to `ungroup()` when there is no `group_by()` preceding? – pd441 Apr 04 '18 at 07:55
  • 1
    The reason is that `rowwise()` essentially groups the data frame by the row number- it behaves similarly to if you had created a row number column and then grouped the data by that column. The use of `ungroup()` removes the rowwise grouping. – bschneidr Aug 09 '21 at 19:45
2
library(dplyr)
library(tidyr)
library(purrr)
library(jsonlite)

sample.df %>%
  mutate(
    json_parsed = map(json_col, ~ fromJSON(., flatten=TRUE))
  ) %>%
  unnest(json_parsed)

#    id
# 1 101
# 2 102
# 3 103
# 4 104
#                                                                           json_col
# 1                                                                [{"foo_a":"bar"}]
# 2                                                  [{"foo_a":"bar","foo_b":"bar"}]
# 3                                                      [{"foo_a":"bar","foo_c":2}]
# 4 [{"foo_a":"bar","foo_b":"bar","foo_c":2,"nested_col":{"foo_d":"bar","foo_e":3}}]
#    startdate foo_a foo_b foo_c nested_col.foo_d nested_col.foo_e
# 1 2010-11-01   bar  <NA>    NA             <NA>               NA
# 2 2008-03-25   bar   bar    NA             <NA>               NA
# 3 2007-03-14   bar  <NA>     2             <NA>               NA
# 4 2006-02-21   bar   bar     2              bar                3

If you are reducing libraries, you can remove purrr and instead use:

...
    json_parsed = lapply(.$json_col, fromJSON, flatten=TRUE)
...
r2evans
  • 141,215
  • 6
  • 77
  • 149
1

I think this will work. The main idea is that we take json_col and turn it into a character string, that we can then pass into the fromJSON function that takes care of the rest.

library(stringi)
library(jsonlite)

sample.df$json_col<- as.character(sample.df$json_col)
json_obj<- paste(sample.df$json_col, collapse = "")
json_obj<- stri_replace_all_fixed(json_obj, "][", ",")
new.df<- cbind(sample.df$id, fromJSON(json_obj), sample.df$startdate)

> new.df
#  sample.df$id foo_a foo_b foo_c nested_col.foo_d nested_col.foo_e
#1          101     _  <NA>    NA             <NA>               NA
#2          102     _     _    NA             <NA>               NA
#3          103     _  <NA>     2             <NA>               NA
#4          104     _     _     2                _                3
#  sample.df$startdate
#1          2010-11-01
#2          2008-03-25
#3          2007-03-14
#4          2006-02-21

Make sure that the cbind part work correctly! In this case it did, but make sure that in your overall manipulations, you don't change the order of things.

Yannis Vassiliadis
  • 1,719
  • 8
  • 14