i have this very specific problem and I dont know how to deal with it, I have also not been able to find any suggestions online, so I would appreciate help!!!
I want to extract the nodes and content of multiple .xml files and create a dataframe in R. I have done so with the code I will present below and it worked without problems. However, once I went back to inspect whether all the content is there to do further analysis, I noticed that the text of the .xml files is not displayed as it should be, but cut off after a few rows and then continues in the last paragraph. The only thing that hints at something being left out is ... at the point of cutoff. I have no idea what I did wrong and most importantly how do get it right. There are not multiple text nodes, I have already checked.
Is there any command I can use that solves this? I have this problem for all the .xmls.
``
setwd("C:/Users/Doctoral Researcher/Desktop/Arbeit/Projektbezogen/DATA/Data_questions/Deutschland/Split_drs19-data/Schriftliche Fragen")
path <- ("C:/Users/Doctoral Researcher/Desktop/Arbeit/Projektbezogen/DATA/Data_questions/Deutschland/Split_drs19-data/Schriftliche Fragen")
library(xml2)
library(dplyr)
files <- list.files(path, pattern = ".xml", recursive= TRUE, include.dirs = TRUE)
dataframe_writtenquestions <-lapply(files, function(file) {
page <- read_xml(file)
electoralperiod <- xml_find_all(page, ".//WAHLPERIODE") %>% xml_text()
typeofdocument <- xml_find_all(page, ".//DOKUMENTART") %>% xml_text()
typeofquestions <- xml_find_all(page, ".//DRS_TYP") %>% xml_text()
number <- xml_find_all(page, ".//NR") %>% xml_text()
date <- xml_find_all(page, ".//DATUM") %>% xml_text()
title <- xml_find_all(page, ".//TITEL") %>% xml_text()
txt <- xml_find_all(page, "//TEXT")) %>% xml_text()
data.frame(electoralperiod, typeofdocument, typeofquestions, number, date, title, txt)
})
df <- bind_rows(dataframe_writtenquestions)
This worked out fine - so I checked the first .xml files and used "inspect element" to see what was happening, and then I saw that most of the content was missing.
dfs <- sample_n(df, 3)
print (dfs)