0

I usually extract this data into a csv file. I can't do that this time because the text in the notes column is too long and gets truncated if I export to csv. If I export to xml then I can get the whole text I think. But I'm struggling to work with the xml output.

I'm using xml2 with RStudio. I tried xml_children(notes) and a few other things, like as_list(xml_children(notes)).

The XML looks like this:

<RESULTS>
    <ROW>
        <COLUMN NAME="SUBJECT_ID"><![CDATA[12345678]]></COLUMN>
        <COLUMN NAME="PAT_ID"><![CDATA[12345678]]></COLUMN>
        <COLUMN NAME="PAT_MRN_ID"><![CDATA[12345678]]></COLUMN>
        <COLUMN NAME="PAT_ENC_CSN_ID"><![CDATA[222111333]]></COLUMN>
        <COLUMN NAME="CREATE_INSTANT_DTTM"><![CDATA[18-JUL-01]]></COLUMN>
        <COLUMN NAME="NAME"><![CDATA[Progress Notes]]></COLUMN>
        <COLUMN NAME="NOTE_ID"><![CDATA[123456]]></COLUMN>
        <COLUMN NAME="LINE"><![CDATA[1]]></COLUMN>
        <COLUMN NAME="NOTE_TEXT"><![CDATA[ text text text]]></COLUMN>
    </ROW>

I want a data frame that looks like notes$SUBJECT_ID, notes$PAT_ID... notes$NOTE_TEXT

but how?

soporific
  • 103
  • 1
  • 8

1 Answers1

0

I've taken the liberty to adjust your example s.t. the number of entries per column is equal.

library(xml2)
library(tidyverse)

string <- "
    <ROW>
        <COLUMN NAME=\"SUBJECT_ID\"><![CDATA[12 345 678]]></COLUMN>
        <COLUMN NAME=\"PAT_ID\"><![CDATA[123 456 78]]></COLUMN>
        <COLUMN NAME=\"PAT_MRN_ID\"><![CDATA[12 345 678]]></COLUMN>
        <COLUMN NAME=\"PAT_ENC_CSN_ID\"><![CDATA[222 1113 33]]></COLUMN>
        <COLUMN NAME=\"CREATE_INSTANT_DTTM\"><![CDATA[18-JUL-01 19-JUL-02 19-JUL-02]]></COLUMN>
        <COLUMN NAME=\"NAME\"><![CDATA[Progress Notes NN]]></COLUMN>
        <COLUMN NAME=\"NOTE_ID\"><![CDATA[12 345 6]]></COLUMN>
        <COLUMN NAME=\"LINE\"><![CDATA[1 1 2]]></COLUMN>
        <COLUMN NAME=\"NOTE_TEXT\"><![CDATA[text text text]]></COLUMN>
    </ROW>"


xml <- read_xml(string)
xml %>% 
  xml_contents() %>% 
  xml_text() %>% 
  map_dfc(~ unlist(str_split(.x, " "))) %>%
  magrittr::set_colnames(
    xml %>% 
      xml_contents() %>% 
      xml_attrs() %>%
      flatten_chr()
  )

# A tibble: 3 x 9
  SUBJECT_ID PAT_ID PAT_MRN_ID PAT_ENC_CSN_ID CREATE_INSTANT_DTTM NAME     NOTE_ID LINE  NOTE_TEXT
  <chr>      <chr>  <chr>      <chr>          <chr>               <chr>    <chr>   <chr> <chr>    
1 12         123    12         222            18-JUL-01           Progress 12      1     text     
2 345        456    345        1113           19-JUL-02           Notes    345     1     text     
3 678        78     678        33             19-JUL-02           NN       6       2     text 
erocoar
  • 5,723
  • 3
  • 23
  • 45
  • The way the elements are listed really is the way they appear. For example Progress Notes is just one entry for NAME (and has no quotes around it) and the NOTE_TEXT is just a huge amount of text (no quotes around it). It looks like you also put in an extra slash mark in the XML? xml_contents() gets an error message for me : ```Error in fun(x$node, ...) : external pointer is not valid ```....maybe because of the XML? – soporific Apr 25 '19 at 17:41
  • I didn't add quotes around it, I just added spaces to simulate more than one data entry per column. Or do you have no whitespace separation? What version of xml2 and R are you running? – erocoar Apr 25 '19 at 17:53