0

I have some trouble with parsing an XML file to a dataframe in R.

I have some XML code

<?xml version="1.0" encoding="windows-1251"?>
<dlc ac="ED29099541DB7B022D00E4179F00" softversion="0.2">
  <statistics enterprise="Организация">
  <shop Id="4" GUID="{F5D518E4-3C80-44E9-835B-D87CC35A7BDB}" 
worktimefrom="2015-04-03 08:00:00" worktimeto="2015-04-03 20:00:00" 
name="Объект" clientId="Client 1">
  <sensor GUID="{63017726-D121-4EB3-A684-BC3D27AED119}" GCGUID="00000000-
 0000-0000-0000-000000000000" Id="25" type="1" minortype="1" address="01" 
 name="Устройство" balance="0" devtype="1">
    <stat datetime="2017-01-20 09:37:00" realin="1" realout="2" />
    <stat datetime="2017-01-20 09:38:00" realin="1" realout="2" />
    <stat datetime="2017-01-20 09:39:00" realin="1" realout="0" />
    <stat datetime="2017-01-20 09:40:00" realin="0" realout="1" />
    <stat datetime="2017-01-20 09:41:00" realin="1" realout="0" />
    <stat datetime="2017-01-20 09:42:00" realin="1" realout="0" />
    <stat datetime="2017-01-20 09:43:00" realin="1" realout="1" />
    <stat datetime="2017-01-20 09:44:00" realin="0" realout="1" />
    <stat datetime="2017-01-20 09:52:00" realin="1" realout="0" />
    <stat datetime="2017-01-20 09:53:00" realin="0" realout="1" />
    <stat datetime="2017-01-20 09:56:00" realin="1" realout="0" />
    <stat datetime="2017-01-20 09:57:00" realin="0" realout="1" />
    <stat datetime="2017-01-20 10:08:00" realin="0" realout="1" />
    <stat datetime="2017-01-20 10:16:00" realin="0" realout="1" />
  </sensor>
</shop>

I need to parse it into a dataframe in R, how do I do this ?

rjdkolb
  • 10,377
  • 11
  • 69
  • 89

1 Answers1

0

It's unclear what exactly you want into the data frame, but here is my solution:

First, the data:

file <- '
<?xml version="1.0" encoding="windows-1251"?>
 <dlc ac="ED29099541DB7B022D00E4179F00" softversion="0.2">
<statistics enterprise="Организация">
<shop Id="4" GUID="{F5D518E4-3C80-44E9-835B-D87CC35A7BDB}" 
worktimefrom="2015-04-03 08:00:00" worktimeto="2015-04-03 20:00:00" 
name="Объект" clientId="Client 1">
  <sensor GUID="{63017726-D121-4EB3-A684-BC3D27AED119}" GCGUID="00000000-
  0000-0000-0000-000000000000" Id="25" type="1" minortype="1" address="01" 
 name="Устройство" balance="0" devtype="1">
    <stat datetime="2017-01-20 09:37:00" realin="1" realout="2" />
    <stat datetime="2017-01-20 09:38:00" realin="1" realout="2" />
    <stat datetime="2017-01-20 09:39:00" realin="1" realout="0" />
    <stat datetime="2017-01-20 09:40:00" realin="0" realout="1" />
   <stat datetime="2017-01-20 09:41:00" realin="1" realout="0" />
   <stat datetime="2017-01-20 09:42:00" realin="1" realout="0" />
    <stat datetime="2017-01-20 09:43:00" realin="1" realout="1" />
    <stat datetime="2017-01-20 09:44:00" realin="0" realout="1" />
    <stat datetime="2017-01-20 09:52:00" realin="1" realout="0" />
    <stat datetime="2017-01-20 09:53:00" realin="0" realout="1" />
    <stat datetime="2017-01-20 09:56:00" realin="1" realout="0" />
    <stat datetime="2017-01-20 09:57:00" realin="0" realout="1" />
    <stat datetime="2017-01-20 10:08:00" realin="0" realout="1" />
    <stat datetime="2017-01-20 10:16:00" realin="0" realout="1" />
  </sensor>
</shop>'

Now, we use rvest to extract the elements from each stat line and put them in a data frame:

library(rvest)
lines <- read_html(file) %>% html_nodes('stat')

time <- lines %>% html_attr('datetime')
realin <- lines %>% html_attr('realin')
realout <- lines %>% html_attr('realout')

df <- data.frame(time, realin, realout, stringsAsFactors = F)

The result is:

> df

##                   time realin realout
## 1  2017-01-20 09:37:00      1       2
## 2  2017-01-20 09:38:00      1       2
## 3  2017-01-20 09:39:00      1       0
## 4  2017-01-20 09:40:00      0       1
## 5  2017-01-20 09:41:00      1       0
## 6  2017-01-20 09:42:00      1       0
## 7  2017-01-20 09:43:00      1       1
## 8  2017-01-20 09:44:00      0       1
## 9  2017-01-20 09:52:00      1       0
## 10 2017-01-20 09:53:00      0       1
## 11 2017-01-20 09:56:00      1       0
## 12 2017-01-20 09:57:00      0       1
Oriol Mirosa
  • 2,756
  • 1
  • 13
  • 15
  • Thank, but maybe you know how extract after I product some variable data <- xmlParse("{00A9CC27-AA7B-4BB4-8745-804B5453382F}.xml") – Panchenko Andrey Aug 11 '17 at 18:04
  • I'm sorry, I don't understand what you mean in your comment. Can you clarify? – Oriol Mirosa Aug 11 '17 at 18:05
  • I have a many XML file I need convert all of them into data frame, but all of solution don't work. And I wount convert it into text before do some dataframe – Panchenko Andrey Aug 11 '17 at 18:09
  • I don't see why my solution wouldn't work. You don't need to convert anything to text. I used the `file` object above because that's what you provided. If you have the names of the files that you need to parse, you can just read them with rvest directly. For instance: `file <- read_html('name_of_your_file.xml')` and then continue as I did. If you have many files, you could loop over them and merge them. It's hard to be specific without knowing the structure and location of all your files, but the core of the solution should be what I wrote above. – Oriol Mirosa Aug 11 '17 at 18:25
  • I use this and this doesn't work https://www.tutorialspoint.com/r/r_xml_files.htm – Panchenko Andrey Aug 11 '17 at 18:26
  • The solution I wrote works with the data you provided. If you need something different, you should be more specific and/or provide different data. – Oriol Mirosa Aug 11 '17 at 18:31
  • file <- read_html("{00A9CC27-AA7B-4BB4-8745-804B5453382F}.xml") lines <- read_html(file) %>% html_nodes('stat') Error in UseMethod("read_xml") : no applicable method for 'read_xml' applied to an object of class "c('xml_document', 'xml_node')" – Panchenko Andrey Aug 11 '17 at 18:34
  • In what you wrote in the comments you are using `read_html()` twice, and that's why you see the error. Notice that in my code there is only one instance of `read_html()`. – Oriol Mirosa Aug 11 '17 at 18:44
  • It's worked, thank you. Last question. How read shop ID shop Id="4" – Panchenko Andrey Aug 11 '17 at 19:24
  • If what you want is to get the '4' and there is only one 'shop ID' in each file, then you can do this: `shopID <- read_html(file) %>% html_node('shop') %>% html_attr('id')`. If, instead, what you're looking for is the long GUID, then you can do this: `shopID <- read_html(file) %>% html_node('shop') %>% html_attr('guid')`. – Oriol Mirosa Aug 11 '17 at 19:32
  • Notice the pattern here: you read the file with `read_html()`, and then you extract the nodes with `html_node()` (specifying the node between the brackets), and the attributes with `html_attr()` (again, specifying the attribute of those nodes). Hope this makes sense. – Oriol Mirosa Aug 11 '17 at 19:33
  • Thank you so much. – Panchenko Andrey Aug 11 '17 at 19:37
  • No problem! Please mark the answer as correct so that others can find it. – Oriol Mirosa Aug 11 '17 at 19:42