My data is drawn from a json API. The structure is a follows:
- The documents structured as a list, with one entry per document.
- Each entry is another list containing the docvars, sometimes these docvars are also in the form of a list.
- The number of docvars is not consistent, ranging between 36 and 49 so not every entry has every docvar.
- Furthermore, the position of the docvars is also not consistent, for example docu[1][4] is sometimes 'date' and other time 'source'
I would like to unnest these lists and create a dataframe in which each document is a row and each docvar is a column, the missing docvars should be NA
library(rjson)
data = rjson::fromJSON(file="http://search.worldbank.org/api/v2/wds?format=json&fl=abstracts,admreg,alt_title,authr,available_in,bdmdt,chronical_docm_id,closedt,colti,count,credit_no,disclosure_date,disclosure_type,disclosure_type_date,disclstat,display_title,docdt,docm_id,docna,docty,dois,entityid,envcat,geo_reg,geo_reg,geo_reg_and_mdk,guid,historic_topic,id,isbn,ispublicdocs,issn,keywd,lang,listing_relative_url,lndinstr,loan_no,majdocty,majtheme,ml_abstract,ml_display_title,new_url,owner,pdfurl,prdln,projectid,projn,publishtoextweb_dt,repnb,repnme,seccl,sectr,src_cit,subsc,subtopic,teratopic,theme,topic,topicv3,totvolnb,trustfund,txturl,unregnbr,url_friendly_title,versiontyp,versiontyp_key,virt_coll,vol_title,volnb&str_docdt=1986-01-01&end_docdt=2000-12-31&rows=500&os=1&srt=docdt&order=desc")
The are a lot of questions like this, however non of the solutions seem to work in this case. For example:
Unnesting a list of lists in a data frame column
library(tidyverse)
tidy <- data$documents %>% bind_rows(data$documents) %>% # make larger sample data
mutate_if(is.list, simplify_all) %>% # flatten each list element internally
unnest() # expand
Error in bind_rows_(x, .id) : Argument 36 must be length 1, not 2
Unnest one of several list columns in dataframe
Convert list of lists to dataframe
R convert list of lists to dataframe
R: How to extract a list from a dataframe?
Extracting data.frames from a list using for loop
R, dpylr: Converting list of lists of differing lenghts within dataframe into long format dataframe
This last one comes near but I have multiple docvars, many of which I do not know the names.
another attempt of mine was using a loop:
df <- data.frame()
df_s <- data.frame()
s=0
#Desired API
for(l in 1:100){
print(l)
s=s+500
url <- paste0("http://search.worldbank.org/api/v2/wds?format=json&fl=abstracts,admreg,alt_title,authr,available_in,bdmdt,chronical_docm_id,closedt,colti,count,credit_no,disclosure_date,disclosure_type,disclosure_type_date,disclstat,display_title,docdt,docm_id,docna,docty,dois,entityid,envcat,geo_reg,geo_reg,geo_reg_and_mdk,guid,historic_topic,id,isbn,ispublicdocs,issn,keywd,lang,listing_relative_url,lndinstr,loan_no,majdocty,majtheme,ml_abstract,ml_display_title,new_url,owner,pdfurl,prdln,projectid,projn,publishtoextweb_dt,repnb,repnme,seccl,sectr,src_cit,subsc,subtopic,teratopic,theme,topic,topicv3,totvolnb,trustfund,txturl,unregnbr,url_friendly_title,versiontyp,versiontyp_key,virt_coll,vol_title,volnb&str_docdt=1986-01-01&end_docdt=2000-12-31&rows=500&os=",s,"&srt=docdt&order=desc")
WBeLib_content = rjson::fromJSON(file= url)
stop <- WBeLib_content$rows
#df <- data.frame()
for(i in 1:500 ){
docu <- WBeLib_content$documents[i]
df[i,1] <- docu[[1]]$url
df[i,2] <- docu[[1]]$txturl
df[i,3] <- docu[[1]]$docdt
df[i,4] <- docu[[1]]$disclstat
df[i,5] <- docu[[1]]$disclosure_date
df[i,6] <- docu[[1]]$versiontyp
df[i,7] <- docu[[1]]$docty
df[i,8] <- docu[[1]]$subtopic
df[i,9] <- docu[[1]]$count
df[i,10] <- docu[[1]]$colti
df[i,11] <- docu[[1]]$historic_topic
df[i,12] <- docu[[1]]$seccl
df[i,13] <- docu[[1]]$lang
df[i,14] <- docu[[1]]$majdocty
df[i,15] <- docu[[1]]$owner
df[i,16] <- docu[[1]]$guid
df[i,17] <- docu[[1]]$repnb
df[i,18] <- docu[[1]]$admreg
df[i,19] <- docu[[1]]$pdfurl
df[i,20] <- docu[[1]]$docm_id
}
if(i>1){ df_s <- rbind(df,df_s) } else { df_s <- df}
}
Yet, as not all docvars are present for each document it's out of bounds. Orienting on position works, but the columns are no longer in order.