1

I need to have a data frame from json or xml files (data is available in both formats here). Yet, I get errors when I try to get those data frames in R.

With json file, the error is the following text

Error in parse_con(txt, bigint_as_char) : lexical error: invalid bytes in UTF8 string. stion":"0","name_question":"Óðî÷èñòå çàñ³äàííÿ Âåðõîâíî¿ Ðàä (right here) ------^

With xml file, the error is like this

Error in [<-.data.frame(*tmp*, i, names(nodes[[i]]), value = c(date_agenda = "27112014", : duplicate subscripts for columns

The commands I use are

library(jsonlite)
library(XML)

k <- fromJSON("https://data.rada.gov.ua/ogd/zal/ppz/skl8/dict/agendas_8_skl.json", encoding = "UTF-8")
m <- xmlToDataFrame("agendas_8_skl.xml") 

Prior to executing the commands, I download files to the working directory.

I do not understand how I can get the data. Please, help!

A. Suliman
  • 12,923
  • 5
  • 24
  • 37
Yuliia Zhaha
  • 111
  • 6

2 Answers2

2

This answer based on @user2554330's answer here

library(jsonlite)
library(RCurl) 
#Incase you have locale different than ukrainian
Sys.setlocale("LC_CTYPE", "ukrainian")
k <- fromJSON(getURL("https://data.rada.gov.ua/ogd/zal/ppz/skl8/dict/agendas_8_skl.json", 
                     .encoding = "ISO-8859-5"))

#transfer k into dataframe using tidyr::unnest
library(dplyr)
library(tidyr)
df <- tibble(date_agenda=k[[1]]$date_agenda, question=k[[1]]$question) %>% 
        unnest(question) %>% 
        unnest(reporter_question, keep_empty=TRUE) 
A. Suliman
  • 12,923
  • 5
  • 24
  • 37
  • It gives me a list. Is it possible to make a data frame out of it? – Yuliia Zhaha Feb 24 '20 at 09:36
  • I have a problem with the same site, but with a different json file. Here is the command I use ```Sys.setlocale("LC_CTYPE", "ukrainian")``` ```mps8 <- fromJSON(getURL("https://data.rada.gov.ua/ogd/mps/skl8/mps-data.json", .encoding = "ISO-8859-5"))``` But it gives me an error. Do you know what might cause a problem? – Yuliia Zhaha Feb 26 '20 at 20:54
  • @YuliiaZhaha I have tested the URL [here](https://jsonformatter.curiousconcept.com/) and I think the json file is corrupted. I try `readLines` as mentioned [here](https://stackoverflow.com/questions/30251576/reading-a-non-standard-csv-file-into-r), I got the data but in a weird format "probably due to encoding". I think you should start a new question. – A. Suliman Feb 29 '20 at 04:53
  • Using `readLines` and `encoding = "UTF-16LE"` in a Linux machine work nicely, I've added a small R script contains the full solution in RStudio Cloud [here](https://rstudio.cloud/project/996195), this project is public just you need an account to log in. – A. Suliman Feb 29 '20 at 15:15
1

Here is a solution working with the xml data.

See the code comments for details:

library(xml2)
library(dplyr)

#read page
page<-read_xml("https://data.rada.gov.ua/ogd/zal/ppz/skl8/dict/agendas_8_skl.xml")

#obtain a list of parent nodes
agendas<-xml_find_all(page, "agenda") 

output<-lapply(agendas, function(agenda) {
  #get date
  date<- agenda %>% xml_find_first(".//date_agenda") %>% xml_text() %>% as.Date(format="%d%m%Y")
  #pull question id from attribute
  question_id <-agenda %>% xml_find_all(".//question") %>% xml_attr("id_question")
  #obtain the information from all of the nodes (assumes equal number of each)
  number_questions <-agenda %>%xml_find_all(".//number_question") %>%  xml_text()
  init_questions <-agenda %>%xml_find_all(".//init_question") %>% xml_text()
  name_questions <-agenda %>%xml_find_all(".//name_question") %>% xml_text()

  #create a data frame of answer (long format)
  data.frame(date, question_id, number_questions, init_questions, name_questions, stringsAsFactors = FALSE)
})

#bind into 1 large long formatted data frame
finalanswer<-bind_rows(output)
head(finalanswer)
Dave2e
  • 22,192
  • 18
  • 42
  • 50
  • It gives me an error on the command ```#obtain a list of parent nodes agendas<-xml_nodes(page, "agenda")``` – Yuliia Zhaha Feb 24 '20 at 09:35
  • @YuliiaZhaha, Sorry updated code to swapped `xml_nodes` for the new `xml_find_all` function. The `xml_node` is in the rvest package. – Dave2e Feb 24 '20 at 14:06