2

I am trying to parse some xml documents in R XML--. DataFrame. What I want to do is flatten the XML tree so that I get one row in data frame per each, child. Also I want for each row to contain data from parent

example:

<xml>
    <eventlist>
        <event>
            <ProcessIndex>1063</ProcessIndex>
            <Time_of_Day>2:54:20.2959537 PM</Time_of_Day>
            <Process_Name>chrome.exe</Process_Name>
            <PID>12164</PID>
            <Operation>ReadFile</Operation>
            <Result>SUCCESS</Result>
            <Detail>Offset: 1,684,224, Length: 256</Detail>
            <stack>
                <frame>
                    <depth>0</depth>
                    <address>0xfffff8038683667c</address>
                    <path>C:\WINDOWS\System32\drivers\FLTMGR.SYS</path>
                    <location>FltDecodeParameters + 0x1a6c</location>
                </frame>
                <frame>
                    <depth>1</depth>
                    <address>0xfffff80386834e13</address>
                    <path>C:\WINDOWS\System32\drivers\FLTMGR.SYS</path>
                    <location>FltDecodeParameters + 0x203</location>
                </frame>
                <frame>
                <depth>3</depth>
                    <address>0x7ffea54ffac1</address>
                    <path>C:\WINDOWS\SYSTEM32\ntdll.dll</path>
                    <location>RtlUserThreadStart + 0x21</location>
                </frame>
            </stack>
        </event>
        <event>
            <ProcessIndex>1063</ProcessIndex>
            <Time_of_Day>2:54:20.2960270 PM</Time_of_Day>
            <Process_Name>chrome.exe</Process_Name>
            <PID>12164</PID>
            <Operation>WriteFile</Operation>
            <Result>SUCCESS</Result>
            <Detail>Offset: 103,016, Length: 36</Detail>
            <stack>
                <frame>
                    <depth>0</depth>
                    <address>0xfffff8038683667c</address>
                    <path>C:\WINDOWS\System32\drivers\FLTMGR.SYS</path>
                    <location>FltDecodeParameters + 0x1a6c</location>
                </frame>
                <frame>
                    <depth>1</depth>
                    <address>0xfffff80386834e13</address>
                    <path>C:\WINDOWS\System32\drivers\FLTMGR.SYS</path>
                    <location>FltDecodeParameters + 0x203</location>
                </frame>
                <frame>
                    <depth>26</depth>
                    <address>0x7ffea54ffac1</address>
                    <path>C:\WINDOWS\SYSTEM32\ntdll.dll</path>
                    <location>RtlUserThreadStart + 0x21</location>
                </frame>
            </stack>
        </event>
    </eventlist>
</xml>

And the result that I would like to get is

ProcesnIndex     Time_of_day    Proces_Name     PID     Operation   Result  depth   address     path            location
1063             2:54:20        chrome.exe      12164   ReadFile    SUCCESS 0       0xfffff..   C:\WINDOWS\System32\driv... FltDecodeParameters + 0x1a6c
1063             2:54:20        chrome.exe      12164   ReadFile    SUCCESS 1       0xfffff..   C:\WINDOWS\System32\driv... FltDecodeParameters + 0x203
1063             2:54:20        chrome.exe      12164   ReadFile    SUCCESS 2       0xfffff..   C:\WINDOWS\System32\driv... tlUserThreadStart + 0x21
1063             2:54:20        chrome.exe      12164   WriteFile   SUCCESS 0       0xfffff..   C:\WINDOWS\System32\driv... FltDecodeParameters + 0x1a6c
1063             2:54:20        chrome.exe      12164   WriteFile   SUCCESS 1       0xfffff..   C:\WINDOWS\System32\driv... FltDecodeParameters + 0x203
1063             2:54:20        chrome.exe      12164   WriteFile   SUCCESS 2       0xfffff..   C:\WINDOWS\System32\driv... RtlUserThreadStart + 0x21

I tried using XML package and xmlToDataFrame

xmldf_events_stack <- xmlToDataFrame(nodes=getNodeSet(data_xml_2,"//eventlist/event/stack/frame"))

but that only gives me flatten frames without parent data. Also If I try to parse event data to dataframe, all XML tags are removed from frame field so there is no way for me to parse that later.

Any help or guid in right direction will be appreciated

MrTommek
  • 103
  • 9

2 Answers2

4

I solved problem, I am sure there is more elegant way to do this but this is what I did. Hope it helps somebody in the future

df <- do.call(rbind.fill, lapply(data_xml_2['//eventlist/event'], function(x) { 
  names <- xpathSApply(x, './/.', xmlName) 
  names <- names[which(names == "text") - 1]
  values <- xpathSApply(x, ".//text()", xmlValue)
  framevalues <- values[8:length(values)]
  framevalues <- matrix(framevalues, ncol = 4, byrow = TRUE)

  retvalues <- framevalues
  for(i in 7:1){
    retvalues <- cbind(values[i],retvalues)
  }
  colnames(retvalues) <- names[1:12] 
  return(as.data.frame(retvalues))
}))
MrTommek
  • 103
  • 9
  • This is nice! Any way to make this work with an arbitrary XML file? I noticed the `8:length(values)` and `7:1` – tcash21 Mar 27 '18 at 19:51
0

Consider parsing by node index, [##], and then merge the parent with children in a lapply for list of dataframes to be row-binded altogether:

doc <- xmlParse("/path/to/XML/file.xml")

xml_len <- length(getNodeSet(doc,"//eventlist/event"))

dflist <- lapply(seq(xml_len), function(i){   
  # PARENT NODES   
  d1 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//eventlist/event[",i,"]"))), key=1)
  # CHILD NODES
  d2 <- transform(xmlToDataFrame(nodes=getNodeSet(doc, paste0("//eventlist/event[",i,"]/stack/frame"))), key=1) 

  # MERGE ON KEY, THEN DROP KEY
  merge(d1, d2, by="key")[-1]      
})

xmldf_events_stack <- do.call(rbind, dflist)
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Thanks for your answer, it works but performance vise it is really slow. For same file (~450MB) with my solution it takes around 40 secs to resolve and with your ~5 mins – MrTommek Jun 20 '17 at 16:49
  • Wow! Over a month later. I could have helped if you reached out earlier and knowing the size of file. Glad you arrived at a solution! Happy coding! – Parfait Jun 20 '17 at 19:20