How to unpack nested JSON type column in a dataframe with R(plus RegEx issue)

Question

I'm very new to R,and I'm currently stuck on this problem: so I imported a JSON file and already***convert it to a dataframe***, now I need to return rows under condition:

As you can see in the picture, I have a column recording hours(payload.hours) My GOAL is to find out the hours that meet: 1. sunday 2. time ealier than 10AM.

I tried several ways but somehow it even doesn't come close at all... I havent dealt with such nested form before... so I have to seek your idea&help...

e.g. one element in payload.hours column

payload.hours ...

[530] "{\"monday\":[[\"10:30\",\"16:00\"]],\"tuesday\":[[\"10:30\",\"16:00\"]],\"wednesday\":[[\"10:30\",\"16:00\"]],\"thursday\":[[\"10:30\",\"16:00\"]],\"friday\":[[\"10:30\",\"16:00\"]],\"saturday\":[[\"10:30\",\"16:00\"]],\"sunday\":[[\"10:30\",\"16:00\"]]}"

this is what I used for unpacking the nested lists in "hours" column...but it doesn't work...

  library(ndjson)
  json<- ndjson::stream_in("#localpath")  
  #successfully converted json to a dataframe...but each element in payload.hours column remains nested like above.

  lapply(json$payload.hours, jsonlite::fromJSON)
  #continue unwarp nested jason column BUT RESULT  Error in if (is.character(txt) && length(txt) == 1 && nchar(txt, type = "bytes") <  :missing value where TRUE/FALSE needed

Another approach I tried(FOR A LONG TIME) is RegEx

hrs<-json1$payload.hours         #select column hours into hrs
tme<-"sunday{1}.{8}[0-9]{1}\""   # ???(not sure about this...seruously)...?  match string with sunday and after 8characters..aka find preceding digit{1} when meet ":" 
iftme<-grepl(tme,hrs)            #set logical factor T/F if matches 
checkhrs<-hrs[iftme]             #check if open hours are correct
checkhrs

And this seems to work...but I am not sure why...(YES.IDK WHY)...so if anyone could explain to me that would be great!

This is the original json file:

https://drive.google.com/open?id=0B-jU6pp4pjS4Smg2RGpHSTlvN2c

This is RegEx output...seems right...but I am not sure about my expression..LOL

You haven't fully converted from JSON yet. You should `lapply` `fromJSON` across that column, which will give you a list column that you can unpack as makes sense. Don't use regex. — alistaire, Jan 16 '17 at 02:10
Hi there, I used lapply(json_file$payload.hours, data.frame, stringsAsFactors = FALSE) to unpack the column...but it didn't work.. could you please offer a way? — leveygao, Jan 16 '17 at 04:32
Why are you calling `data.frame`? You need to read the JSON, not make more nested data.frames. If you want an actual answer, edit to make your example [reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example#5963610). — alistaire, Jan 16 '17 at 04:34
I have already converted my json file into a dataframe...but this column remains nested like pic shows....so I think now I need only unnest this one and do my analysis...expect your help! — leveygao, Jan 16 '17 at 12:09
No, you've _partially_ converted your JSON file into a data.frame. The bits you're calling "nested" are still JSON, just stored as character strings in a data.frame column. You need some variant of `lapply(df$payload.hours, jsonlite::fromJSON)`, as I told you in my first comment. — alistaire, Jan 16 '17 at 19:33
Hi ali...by using your code...it suggests: Error in if (is.character(txt) && length(txt) == 1 && nchar(txt, type = "bytes") < : missing value where TRUE/FALSE needed ...do you have any idea? Moreover..I tried regex to solve this..result attached....but I am not sure why my expression works?(or not ?) can you take a look? — leveygao, Jan 17 '17 at 01:36
I'm sorry, but if you're not going to make a reproducible example (see the link above), I'm not going to waste my time trying to figure out what's happening. — alistaire, Jan 17 '17 at 01:38
I'm so sorry...I attached the original file(should done this earlier)...and I showed my code to read json(very simple stream_in from ndjson)...again that hours column doesn't get unpack; and I also showed my way&result for regex. Thank you! — leveygao, Jan 17 '17 at 01:49

score 1 · Accepted Answer · answered Jan 17 '17 at 04:02

Unpacking JSON can be a lot of work, particularly if it is deeply nested. Most JSON reading packages (jsonlite, RJSONIO, etc.) can turn data into something close to a data.frame, but fixing the structure requires an understanding that the reader functions don't have. Since JSON most nearly corresponds to R's lists, cleaning up data coming from JSON typically involves a lot of lapply and its variants. Here I'll use purrr, which has many useful variants and helper functions and works neatly with dplyr.

library(tidyverse)

# Read data
json <- jsonlite::stream_in(file('~/Downloads/jsondata.json'))

        # Initial cleanup to proper data.frame
json <- json$payload %>% map_df(simplify_all) %>% dmap(simplify) %>% 
    mutate(uuid = json$uuid,    # re-add uuid subset out at beginning
           # Convert hours to a list column of data.frames
           hours = hours %>% map_if(negate(is.na), jsonlite::fromJSON) %>%
               map(~map_df(.x, as_data_frame, .id = 'day')), 
           # Add Boolean variable for whether Sunday opening hours are before 10a. Subset,
           open_sun_before_10 = hours %>% map(~.x %>% filter(day == 'sunday') %>% .[[2]]) %>%
               map(as.POSIXct, format = '%H:%M') %>%    # convert to datetime,
               map(~.x < as.POSIXct('10:00', format = '%H:%M')) %>%    # compare to 10a
               map_lgl(~ifelse(length(.x) == 0, NA, .x)))    # and cleanup.

Whereas stream_in returned a data.frame with two columns (one very deeply nested), the columns are now less nested. There are still JSON structures in some of the untouched columns, though, which will have to be addressed if you want to use the data.

json
#> # A tibble: 538 × 42
#>    existence_full                               geo_virtual  latitude
#>             <dbl>                                     <chr>     <chr>
#> 1        1.000000 ["56.9459720|-2.1971226|20|within_50m|4"] 56.945972
#> 2        1.000000   ["56.237480|-5.073578|20|within_50m|4"] 56.237480
#> 3        1.000000     ["51.483872|-0.606820|100|rooftop|2"] 51.483872
#> 4        1.000000     ["57.343233|-2.191955|100|rooftop|4"] 57.343233
#> 5        1.000000   ["53.225815|-4.094775|20|within_50m|4"] 53.225815
#> 6        1.000000 ["58.9965740|-3.1882195|20|within_50m|4"] 58.996574
#> 7        1.000000     ["57.661419|-2.520144|100|rooftop|4"] 57.661419
#> 8        1.000000   ["51.642727|-3.934845|20|within_50m|4"] 51.642727
#> 9        0.908251                                      <NA>      <NA>
#> 10       1.000000     ["56.510558|-5.401638|100|rooftop|2"] 56.510558
#> # ... with 528 more rows, and 39 more variables: locality <chr>,
#> #   `_records_touched` <chr>, address <chr>, email <chr>,
#> #   existence_ml <dbl>, domain_aggregate <chr>, name <chr>,
#> #   search_tags <list>, admin_region <chr>, existence <dbl>,
#> #   category_labels <list>, post_town <chr>, region <chr>,
#> #   review_count <chr>, geocode_level <chr>, tel <chr>, placerank <int>,
#> #   longitude <chr>, placerank_ml <dbl>, fax <chr>,
#> #   category_ids_text_search <chr>, website <chr>, status <chr>,
#> #   geocode_confidence <chr>, postcode <chr>, category_ids <list>,
#> #   country <chr>, `_geocode_quality` <chr>, hours_display <chr>,
#> #   hours <list>, neighborhood <list>, languages <chr>,
#> #   address_extended <chr>, status_closed <chr>, po_box <chr>,
#> #   name_variants <list>, yext_id <chr>, uuid <chr>,
#> #   open_sun_before_10 <lgl>

And the columns created:

json %>% select(hours, open_sun_before_10)
#> # A tibble: 538 × 2
#>               hours open_sun_before_10
#>              <list>              <lgl>
#> 1  <tibble [1 × 2]>                 NA
#> 2  <tibble [1 × 2]>                 NA
#> 3  <tibble [7 × 3]>              FALSE
#> 4  <tibble [1 × 2]>                 NA
#> 5  <tibble [7 × 3]>              FALSE
#> 6  <tibble [1 × 2]>                 NA
#> 7  <tibble [1 × 2]>                 NA
#> 8  <tibble [6 × 3]>                 NA
#> 9  <tibble [1 × 2]>                 NA
#> 10 <tibble [7 × 3]>               TRUE
#> # ... with 528 more rows

Thank you so much... I will spend some time digesting this kind of technique! — leveygao, Jan 17 '17 at 07:14

How to unpack nested JSON type column in a dataframe with R(plus RegEx issue)

1 Answers1