Convert XML to a data frame

Question

I am looking to transfer data from a .php link into an R data frame, but am having trouble doing so.

Attempts thus far:

require(XML)
data <- xmlParse("http://www.mahdial-husseini.com/xmlthing.php ")
xml_data <- xmlToList(data)

The error I am getting: Error: 1: failed to load HTTP resource

Additionally (and more conceptually), I don't quite understand the nature of the link. Is this XML data in a php file, and if so, when using R to gather data, do I treat it as XML or PHP? Thank you

What have you tried that didn't generate the output you expected or caused errors that were confusing. There are literally tens of dozens of posts on SO about converting XML to data frames. Which ones did you review? What processes in them did not work? This is not a code-writing service and your question is likely to get closed w/o more information and signs you did some work or have familiarity with R. PHP also has nothing to do with this question. — hrbrmstr, Nov 16 '17 at 22:43
Thanks for the fair reply. I am new to PHP, and was under the impression that the attached link was somehow unique to PHP given the file extension, despite the data itself being XML. Also, my efforts with posts in XML form have not worked out; they are being added to the original post — csapidus, Nov 16 '17 at 23:00

BenoitLondon · Answer 1 · 2017-11-17T09:42:51.633

you can use rvest package (and data.table for convenience)

library(data.table)
library(rvest)
a <- read_html("http://www.mahdial-husseini.com/xmlthing.php")
dt <- rbindlist(lapply(a %>% html_nodes(css = "body > ppm1_0 > ppm1_0") %>% 
                             xml_attrs(), 
                       function(x) as.data.table(t((x)))))
dt <- cbind(dt[,2, with = FALSE], 
            as.data.table(lapply(dt[,-2, with = FALSE], as.numeric)))
dt

          site sample month day year hour      jd doy pm25_hourly aod_47
1:      duluth      0     0   0    0    0       0   0           0  0.000
2:                 19    12   0 2004    5       0   0          30  0.000
3:    Sarasota   4545     4   0 2017    0       0   0           0  0.000
4:     Atlanta  11111    10   1 2004   13 2453280 275          23  0.379
5:  Birmingham  11112    10   2 2008   14 2453281 276           0  0.000
6:    Savannah  11113    10   3 2004   13 2453282 277          15  0.181
7:   Fort Knox  11114     6  20 2017   21       0 301          18  0.000
8: Fort Rucker  63738     1   0 2015    0       0   0          40  0.000
   omi_aot omi_no2      fit res      lng     lat rel_humid altitude pressure
1:   0.000   0.000  0.00000   0  84.1000 34.0000         0        0        0
2:   0.000   0.000  0.00000   0  63.6167 38.4161         0        0        0
3:   0.000   0.000  0.00000   0 -82.5300 27.3300         0        0        0
4:   0.148   0.274 16.01850  NA -84.7000 33.7500         0        0        0
5:   0.000   0.000 19.19440   0 -86.8000 33.5200         0        0        0
6:   0.133   0.127  9.00433  NA -81.1000 32.0800         0        0        0
7:   0.000   0.000  0.00000   0 -85.9500 37.9100         0        0        0
8:   0.000   0.000  0.00000   0 -85.7000 31.3400         0        0        0
   signal_received temp_c
1:               0      0
2:               0      0
3:               0      0
4:               0      0
5:               0      0
6:               0      0
7:               0      0
8:               0      0

score 3 · Accepted Answer · answered Nov 17 '17 at 00:29

Or, possibly something readable:

library(xml2)
library(tidyverse)

This will help make better column names:

mcga <- function(tbl) {
  x <- colnames(tbl)
  x <- tolower(x)
  x <- gsub("[[:punct:][:space:]]+", "_", x)
  x <- gsub("_+", "_", x)
  x <- gsub("(^_|_$)", "", x)
  x <- make.unique(x, sep = "_")
  colnames(tbl) <- x
  tbl
}

This gets figured out automagically but it's nice to define it after it figures it out since it help with data consistency:

cols(
  .default = col_integer(),
  site = col_character(),
  aod_47 = col_double(),
  omi_aot = col_double(),
  omi_no2 = col_double(),
  fit = col_double(),
  lng = col_double(),
  lat = col_double()
) -> xdf_cols

Now the work:

doc <- read_xml("http://www.mahdial-husseini.com/xmlthing.php")

xml_find_all(doc, ".//PPM1_0") %>% 
  map_df(~{
    xml_attrs(.x) %>% 
      as.list()
  }) %>% 
  mcga() %>% 
  type_convert(col_types = xdf_cols) -> xdf

The type_convert() isn't fully necessary but it — with the column definitions — make for consistency in results.

And, the results:

xdf
## # A tibble: 8 x 21
##   sample        site month   day  year  hour      jd   doy pm25_hourly aod_47 omi_aot omi_no2      fit   res
##    <int>       <chr> <int> <int> <int> <int>   <int> <int>       <int>  <dbl>   <dbl>   <dbl>    <dbl> <int>
## 1      0      duluth     0     0     0     0       0     0           0  0.000   0.000   0.000  0.00000     0
## 2     19        <NA>    12     0  2004     5       0     0          30  0.000   0.000   0.000  0.00000     0
## 3   4545    Sarasota     4     0  2017     0       0     0           0  0.000   0.000   0.000  0.00000     0
## 4  11111     Atlanta    10     1  2004    13 2453280   275          23  0.379   0.148   0.274 16.01850    NA
## 5  11112  Birmingham    10     2  2008    14 2453281   276           0  0.000   0.000   0.000 19.19440     0
## 6  11113    Savannah    10     3  2004    13 2453282   277          15  0.181   0.133   0.127  9.00433    NA
## 7  11114   Fort Knox     6    20  2017    21       0   301          18  0.000   0.000   0.000  0.00000     0
## 8  63738 Fort Rucker     1     0  2015     0       0     0          40  0.000   0.000   0.000  0.00000     0
## # ... with 7 more variables: lng <dbl>, lat <dbl>, rel_humid <int>, altitude <int>, pressure <int>,
## #   signal_received <int>, temp_c <int>

Full structure:

glimpse(xdf)
## Observations: 8
## Variables: 21
## $ sample          <int> 0, 19, 4545, 11111, 11112, 11113, 11114, 63738
## $ site            <chr> "duluth", NA, "Sarasota", "Atlanta", "Birmingham", "Savan...
## $ month           <int> 0, 12, 4, 10, 10, 10, 6, 1
## $ day             <int> 0, 0, 0, 1, 2, 3, 20, 0
## $ year            <int> 0, 2004, 2017, 2004, 2008, 2004, 2017, 2015
## $ hour            <int> 0, 5, 0, 13, 14, 13, 21, 0
## $ jd              <int> 0, 0, 0, 2453280, 2453281, 2453282, 0, 0
## $ doy             <int> 0, 0, 0, 275, 276, 277, 301, 0
## $ pm25_hourly     <int> 0, 30, 0, 23, 0, 15, 18, 40
## $ aod_47          <dbl> 0.000, 0.000, 0.000, 0.379, 0.000, 0.181, 0.000, 0.000
## $ omi_aot         <dbl> 0.000, 0.000, 0.000, 0.148, 0.000, 0.133, 0.000, 0.000
## $ omi_no2         <dbl> 0.000, 0.000, 0.000, 0.274, 0.000, 0.127, 0.000, 0.000
## $ fit             <dbl> 0.00000, 0.00000, 0.00000, 16.01850, 19.19440, 9.00433, 0...
## $ res             <int> 0, 0, 0, NA, 0, NA, 0, 0
## $ lng             <dbl> 84.1000, 63.6167, -82.5300, -84.7000, -86.8000, -81.1000,...
## $ lat             <dbl> 34.0000, 38.4161, 27.3300, 33.7500, 33.5200, 32.0800, 37....
## $ rel_humid       <int> 0, 0, 0, 0, 0, 0, 0, 0
## $ altitude        <int> 0, 0, 0, 0, 0, 0, 0, 0
## $ pressure        <int> 0, 0, 0, 0, 0, 0, 0, 0
## $ signal_received <int> 0, 0, 0, 0, 0, 0, 0, 0
## $ temp_c          <int> 0, 0, 0, 0, 0, 0, 0, 0

Convert XML to a data frame

2 Answers2

Linked