4

I am trying to use XML package and either xmlToList or xmlToDataFrame function. My input data is on the internet (first 2 lines) and I only need to work with certain part of the XML (see the third nodeset command)

url<- 'http://ClinicalTrials.gov/show/NCT00191100?resultsxml=true'
xml = xmlTreeParse(url,useInternalNode=TRUE)
ns <- getNodeSet(xml, '/clinical_study/clinical_results/reported_events/serious_events/category_list')

It is a list of categories and inside categories are “events”. And events have counts (and counts are specific to clinical trial arms (eg, drug vs. placebo arms)

I only need the events, so the best listing is here for cario-respiratory arrest using xmlToList

xl<-xmlToList(url)
set2<-xl$clinical_results$reported_events$serious_events$category_list
set2[[3]]

> set2[[3]]
$title
[1] "Cardiac disorders"

$event_list
$event_list$event
$event_list$event$sub_title
[1] "Cardio-respiratory arrest"

$event_list$event$counts
         group_id            events subjects_affected  subjects_at_risk 
             "E1"               "1"               "1"             "260" 

$event_list$event$counts
         group_id            events subjects_affected  subjects_at_risk 
             "E2"               "0"               "0"             "255" 

I am not able to use xmlToDataFrame due to this error. (the nodeset2 has all data in XMLattributes and I think the xmlTODataFrame may not like this)

hopefulyDF <- getNodeSet(xml, '/clinical_study/clinical_results/reported_events/serious_events/category_list/category/event_list/event/counts')
 xmlToDataFrame(node = hopefulyDF)
Error in matrix(vals, length(nfields), byrow = TRUE) : 
  'data' must be of a vector type, was 'NULL'

How to best extract the counts data? I tried unlist but I am not advanced in R enough, probably. I would like to avoid loop and manual xmlGetAttr. But in the worst case, any solution is accepted. I find the XML package very dense with 2 version of XML data as list and as NodeSets... :-(

Ideal output would look like this: (all events(not just row 3)

event group_ID numerator denumerator
Cardio-respiratory arrest   E1    1   260
Cardio-respiratory arrest   E2    0   250

(or even have a category column (cardiac disorders) - that would be super-ideal)

p.s. I used this question How to transform XML data into a data.frame? and that question R list to data frame but with no luck. :-(

Community
  • 1
  • 1
userJT
  • 11,486
  • 20
  • 77
  • 88

1 Answers1

4

You can simplify the XML extraction by iterating over each event and extracting the counts attributes via a relative XPath. By using rbindlist from the data.table package, you can deal with the missing attributes without adding in conditional code:

library(XML)
library(data.table)

url <- 'http://ClinicalTrials.gov/show/NCT00191100?resultsxml=true'
xml <- xmlTreeParse(url,useInternalNode=TRUE)

ns <- getNodeSet(xml, '//event')

rbindlist(lapply(ns, function(x) {
  event <- xmlValue(x)
  data.frame(event, t(xpathSApply(x, ".//counts", xmlAttrs)))
}), fill=TRUE)

##                              event group_id subjects_affected events subjects_at_risk
##   1: Total, serious adverse events       E1                44     NA               NA
##   2: Total, serious adverse events       E2                17     NA               NA
##   3:                       Anaemia       E1                 6      6              260
##   4:                       Anaemia       E2                 0      0              255
##   5:           Febrile neutropenia       E1                 6      6              260
##  ---                                                                                 
## 174:                         Cough       E2                15     16              255
## 175:                      Pruritus       E1                14     16              260
## 176:                      Pruritus       E2                 9      9              255
## 177:                  Hypertension       E1                19     19              260
## 178:                  Hypertension       E2                21     21              255

You can always convert it back to a data.frame and/or rename columns if needed.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
  • This is great approach. For some reason the fill=TRUE gives me error. Without it, I can see nice data.frames (with a fill problem). My data.table (v1.9.2) does not have a fill parameter defined. I rewrote with do.call("rbind") but with no luck – userJT Oct 08 '14 at 16:39
  • 1
    You need data.table >= 1.9.3 – hrbrmstr Oct 08 '14 at 16:41