I am attempting to build a dataset from unstructured data. The raw data is a series of json files each of which contains information about a single element of the data (eg. each file becomes a row in the final data). I am looping through the jsons using jsonlite to turn each into a huge nested list. This whole operation is falling apart on the basis of a seemingly simple problem:
I need to append rows to my data where some of the elements do not exist.
The raw data is like so:
jsn1 <- list(id='id1', col1=1, col2='A', col3=11)
jsn2 <- list(id='id2', col1=2, col2='B', col3=12)
jsn3 <- list(id='id3', col2='C', col3=13)
jsn4 <- list(id='id4', col1=3, col3=14)
The structure I am trying to get to is this:
df <- data.frame(id=c('id1','id2','id3','id4'),
col1=c(1,2,NA,4),
col2=c('A','B','C',NA),
col3=c(11,12,13,14))
> df
id col1 col2 col3
1 id1 1 A 11
2 id2 2 B 12
3 id3 NA C 13
4 id4 4 <NA> 14
My approach is along the lines of:
#Collect the json names in a vector
files=c('jsn1','jsn2','jsn3','jsn4')
#Initialize the dataframe with the first row filling in any missing values.
#I didn't do this at first, but it seems helpful.
df1=data.frame(id=jsn1$id,
col1=jsn1$col1,
col2=jsn1$col2,
col3=jsn1$col3,
stringsAsFactors=F)
#Create a loop to loop through the files extracting the values then add them to a dataframe.
for (i in 2:length(files)) {
a <- get(files[i])
new.row <- list(id=a$id,
col1=a$col1,
col2=a$col2,
col3=a$col3)
df1 <- rbind(df1,b)
}
However, this doesn't work because df1 <- rbind(df1,new.row)
requires the columns to be the same length. I have tried df1 <- rbind.fill(df1,new.row)
, rbindlist(list(df1,new.row),use.names=T,fill=T)
, and df[nrow(df1) +1,names(new.row)] <- new.row
. And read this and this among others.
Most answers can add to the data frame by "knowing" a priori what columns will be null /not null. Then constructing a df without those columns and adding it with fill. This won't work as I have no idea what columns will be present ahead of time. The missing ones currently end up with 0 elements which is the root of the problem, but I need to check if they are present. It seems like there should be an easy way to handle this either "on read" or on the rbind, but I can't seem to figure it out.
There are potentially hundreds of columns and millions of rows (though 10s and 100s right now). The jsons are large so reading them all into memory / contacting the lists somehow is probably not possible with the real data. A solution using data.table would probably be ideal. But any help is appreciated. Thanks.