Parsing unstructured data - Append rows with non-existant values rows to a dataset

Question

I am attempting to build a dataset from unstructured data. The raw data is a series of json files each of which contains information about a single element of the data (eg. each file becomes a row in the final data). I am looping through the jsons using jsonlite to turn each into a huge nested list. This whole operation is falling apart on the basis of a seemingly simple problem:

I need to append rows to my data where some of the elements do not exist.

The raw data is like so:

jsn1 <- list(id='id1', col1=1, col2='A', col3=11)
jsn2 <- list(id='id2', col1=2, col2='B', col3=12)
jsn3 <- list(id='id3', col2='C', col3=13)
jsn4 <- list(id='id4', col1=3, col3=14)

The structure I am trying to get to is this:

df <- data.frame(id=c('id1','id2','id3','id4'),
                 col1=c(1,2,NA,4),
                 col2=c('A','B','C',NA),
                 col3=c(11,12,13,14))

> df
   id col1 col2 col3
1 id1    1    A   11
2 id2    2    B   12
3 id3   NA    C   13
4 id4    4 <NA>   14

My approach is along the lines of:

#Collect the json names in a vector
files=c('jsn1','jsn2','jsn3','jsn4')

#Initialize the dataframe with the first row filling in any missing values.  
#I didn't do this at first, but it seems helpful.
df1=data.frame(id=jsn1$id,
     col1=jsn1$col1,
     col2=jsn1$col2,
     col3=jsn1$col3,
     stringsAsFactors=F)

#Create a loop to loop through the files extracting the values then add them to a dataframe.
for (i in 2:length(files)) {
    a <- get(files[i])
    new.row <- list(id=a$id,
              col1=a$col1,
              col2=a$col2,
              col3=a$col3) 
    df1 <- rbind(df1,b)
}

However, this doesn't work because df1 <- rbind(df1,new.row) requires the columns to be the same length. I have tried df1 <- rbind.fill(df1,new.row), rbindlist(list(df1,new.row),use.names=T,fill=T), and df[nrow(df1) +1,names(new.row)] <- new.row. And read this and this among others.

Most answers can add to the data frame by "knowing" a priori what columns will be null /not null. Then constructing a df without those columns and adding it with fill. This won't work as I have no idea what columns will be present ahead of time. The missing ones currently end up with 0 elements which is the root of the problem, but I need to check if they are present. It seems like there should be an easy way to handle this either "on read" or on the rbind, but I can't seem to figure it out.

There are potentially hundreds of columns and millions of rows (though 10s and 100s right now). The jsons are large so reading them all into memory / contacting the lists somehow is probably not possible with the real data. A solution using data.table would probably be ideal. But any help is appreciated. Thanks.

Rich Scriven · Answer 1 · 2015-09-28T22:00:25.080

You could do

data.table::rbindlist(mget(ls(pattern = "jsn[1-4]")), fill = TRUE)
#     id col1 col2 col3
# 1: id1    1    A   11
# 2: id2    2    B   12
# 3: id3   NA    C   13
# 4: id4    3   NA   14

Here mget(ls(pattern = "jsn[1-4]")) is a more programmatic way to gather the lists from the global environment that match the pattern jsn followed by the numbers 1-4. It's just the same as list(jsn1, jsn2, jsn3, jsn4) except it comes with names. You could just as easily do

rbindlist(list(jsn1, jsn2, jsn3, jsn4), fill = TRUE)

The ls() method will be better if you have many more jsn* lists.

score 1 · Answer 2 · answered Sep 28 '15 at 21:51

You want rbind.fill from plyr. First you have to convert all your lists to dataframes (here using lapply(mylists, as.data.frame)), then you can use rbind.fill to bind them and fill missing rows with NA:

library(plyr)
rbind.fill(lapply(list(jsn1, jsn2, jsn3, jsn4), as.data.frame))

   id col1 col2 col3
1 id1    1    A   11
2 id2    2    B   12
3 id3   NA    C   13
4 id4    3 <NA>   14

score 0 · Answer 3 · edited May 23 '17 at 12:29

I am going to jump back in because I have figured this out and want to share in case anyone else comes across this problem. The problem was reading values that are not present in the loop was creating nulls in my list that prevent rbindlist and rbind.fill from working.

To be more specific, I was doing this:

new.row <- list(id=a$id,
              col1=a$col1,
              col2=a$col2,
              col3=a$col3)

when a$col2 was not in the list read from the json. This causes new.row$col2 to be NULL and then you cannot use new.row in rbindlist or rbind.fill. However, all that I needed to do was remove these nulls from the list like so

plyr::compact(new.row)

Source

before then using rbindlist. Both answers were helpful by showing me that rbindlist or rbind.fill would work without the null values.

Parsing unstructured data - Append rows with non-existant values rows to a dataset

3 Answers3