1

Sorry, edited with one more little nuance! I had simplified my raw file a little too much in the example I provided, so while your solution works beautifully as-is, what if there are a few extra things thrown into the second line? Those seem to throw off the xml_find_all(page, "//event"), since now it can't find that node. How can I get the script to ignore the extras (or maybe what is the right search term to incorporate them?) Thanks!!!

I'm new to working with xml, and I have some speech xml files that I'm trying to flatten into dataframes in R, but I can't get them to be read using some of the standard functions in the XML package. I think the problem is the plist format, because some of the other answers that I've tried to apply don't work on these files.

My files look as follows (*****second line edited):

<?xml version="1.0" encoding="us-ascii"?>
<event id="111" extraInfo="CivilwarSpeeches" xmlns = "someurl>
    <meta>
            <title>Gettysburg</title>
            <date>1863-11-19</date>
            <organizations>
                    <org>Union</org>
            </organizations>
            <people>
                    <person id="0" type="President">Honest Abe</person>
            </people>
    </meta>
    <body>
            <section name="Address">
                    <speaker id="0">
                            <plist>
                                    <p>Four score and seven years ago</p>
                            </plist>
                    </speaker>
            </section>
    </body>
</event>

And I would like to end up with a dataframe that links some of the info in the two sections, something like

Section|Speaker|Speaker Type| Speaker Name|Body

Address|0 |President | Honest Abe |Four score and seven years ago

I found this answer fairly helpful, but it still can't seem to unpack my data. Parsing XML file with known structure and repeating elements

Any help would be appreciated!

icedredtea
  • 13
  • 3

1 Answers1

0

I prefer to use the xml2 library over the xml library.
This is a pretty straight forward problem. Read the data in, parse out the desired attributes and nodes and assemble into a data frame.

library(xml2)
page<-read_xml('<?xml version="1.0" encoding="us-ascii"?>
<event id="111">
         <meta>
         <title>Gettysburg</title>
         <date>1863-11-19</date>
         <organizations>
         <org>Union</org>
         </organizations>
         <people>
         <person id="0" type="President">Honest Abe</person>
         </people>
         </meta>
         <body>
         <section name="Address">
         <speaker id="0">
         <plist>
         <p>Four score and seven years ago</p>
         </plist>     </speaker>     </section>     </body> </event>')


#get the nodes
nodes<-xml_find_all(page, "//event")

#parse the requested information out of each node
Section<- xml_attr(xml_find_first(nodes, ".//section"), "name")
Speaker<- xml_attr(xml_find_first(nodes, ".//person"), "id")
SpeakerType<- xml_attr(xml_find_first(nodes, ".//person"), "type")
SpeakerName<- xml_text(xml_find_first(nodes, ".//person")) 
Body<- xml_text(xml_find_first(nodes, ".//plist/p"))  

#put together into a data.frame
answer<-data.frame(Section, Speaker, SpeakerType, SpeakerName, Body)

The code is set up to parse a series of "event" nodes. For clarity I am using 5 steps to parse out each requested information field out separately and then combine into the final dataframe.
Part of the justification for this is to maintain alignment in case the "event" nodes are missing some of the requested information. This could be simplified, but if your dataset is small, there shouldn't be much of a performance impact.

Dave2e
  • 22,192
  • 18
  • 42
  • 50
  • This works great, but my bad, I guess I oversimplified my example. I made the edit in my original question, the problem seems to be the extra bits of info in the event part of the xml, so nodes<-xml_find_all(page, "//event") gets tripped up. Do you know how to stop that from happening? – icedredtea Mar 26 '18 at 17:57
  • Yes, it looks like a namespace is defined. The way I typically deal with this is to strip out the namespaces with the `xml_ns_strp()` function also in the `xml2` package. – Dave2e Mar 26 '18 at 18:15
  • Ahh, ok, that's what those are. Love the xml2 package! – icedredtea Mar 26 '18 at 18:25