Text from .xml data is missing, even though there is a node "TEXT" defined and I have extracted a dataframe - what can I do in R?

Question

i have this very specific problem and I dont know how to deal with it, I have also not been able to find any suggestions online, so I would appreciate help!!!

I want to extract the nodes and content of multiple .xml files and create a dataframe in R. I have done so with the code I will present below and it worked without problems. However, once I went back to inspect whether all the content is there to do further analysis, I noticed that the text of the .xml files is not displayed as it should be, but cut off after a few rows and then continues in the last paragraph. The only thing that hints at something being left out is ... at the point of cutoff. I have no idea what I did wrong and most importantly how do get it right. There are not multiple text nodes, I have already checked.

Is there any command I can use that solves this? I have this problem for all the .xmls.

``

setwd("C:/Users/Doctoral Researcher/Desktop/Arbeit/Projektbezogen/DATA/Data_questions/Deutschland/Split_drs19-data/Schriftliche Fragen")
path <- ("C:/Users/Doctoral Researcher/Desktop/Arbeit/Projektbezogen/DATA/Data_questions/Deutschland/Split_drs19-data/Schriftliche Fragen")
library(xml2)
library(dplyr)
files <- list.files(path, pattern = ".xml", recursive= TRUE, include.dirs = TRUE)
dataframe_writtenquestions <-lapply(files, function(file) {
  page <- read_xml(file)
  electoralperiod <- xml_find_all(page, ".//WAHLPERIODE") %>% xml_text()
  typeofdocument <- xml_find_all(page, ".//DOKUMENTART") %>% xml_text()
  typeofquestions <- xml_find_all(page, ".//DRS_TYP") %>% xml_text()
  number <- xml_find_all(page, ".//NR") %>% xml_text()
  date <- xml_find_all(page, ".//DATUM") %>% xml_text()
  title <- xml_find_all(page, ".//TITEL") %>% xml_text()
  txt <- xml_find_all(page, "//TEXT")) %>% xml_text()
  data.frame(electoralperiod, typeofdocument, typeofquestions, number, date, title, txt)
})



df <- bind_rows(dataframe_writtenquestions)

This worked out fine - so I checked the first .xml files and used "inspect element" to see what was happening, and then I saw that most of the content was missing.

dfs <- sample_n(df, 3)
print (dfs)

It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. It's hard to help if we have no idea what's actually in the XML files. — MrFlick, Nov 29 '22 at 15:58

Text from .xml data is missing, even though there is a node "TEXT" defined and I have extracted a dataframe - what can I do in R?

This worked out fine - so I checked the first .xml files and used "inspect element" to see what was happening, and then I saw that most of the content was missing.

0 Answers0