Save content in web as data.frame

Question

I want to grab content in the url while the original data come in simple column and row. I tried readHTMLTable and obviously its not working. Using webcsraping xpath, how to get clean data without '\n...' and keep the data in data.frame. Is this possible without saving in csv? kindly help me to improve my code. Thank you

library(rvest)
library(dplyr)
page <- read_html("http://weather.uwyo.edu/cgi-bin/sounding?region=seasia&TYPE=TEXT%3ALIST&YEAR=2006&MONTH=09&FROM=0100&TO=0100&STNM=48657")

xpath <- '/html/body/pre[1]'
txt <- page %>% html_node(xpath=xpath) %>% html_text()
txt

[1] "\n-----------------------------------------------------------------------------\n   PRES   HGHT   TEMP   DWPT   RELH   MIXR   DRCT   SKNT   THTA   THTE   THTV\n    hPa     m      C      C      %    g/kg    deg   knot     K      K      K \n-----------------------------------------------------------------------------\n 1009.0     16   23.8   22.7     94  17.56    170      2  296.2  346.9  299.3\n 1002.0     78   24.6   21.6     83  16.51    252      4  297.6  345.6  300.5\n 1000.0     96   24.4   21.3     83  16.23    275      4  297.6  344.8  300.4\n  962.0    434   22.9   20.0     84  15.56    235     10  299.4  345.0  302.1\n  925.0    777   21.4   18.7     85  14.90    245     11  301.2  345.2  303.9\n  887.0   1142   20.3   16.0     76  13.04    255     15  303.7  342.7  306.1\n  850.0   1512   19.2   13.2     68  11.34    230     17  306.2  340.6  308.3\n  839.0   1624   18.8   11.8     64  10.47    225     17  307.0  338.8  308.9\n  828.0   1735   18.0   11.4     65  10.33   ... <truncated>

FWIW rOpenSci has many wx data grabber packages https://github.com/ropensci — hrbrmstr, Sep 27 '18 at 14:28

score 1 · Answer 1 · answered Sep 27 '18 at 11:30

Your data is truncated, so I'll work with what I can:

txt <- "\n-----------------------------------------------------------------------------\n   PRES   HGHT   TEMP   DWPT   RELH   MIXR   DRCT   SKNT   THTA   THTE   THTV\n    hPa     m      C      C      %    g/kg    deg   knot     K      K      K \n-----------------------------------------------------------------------------\n 1009.0     16   23.8   22.7     94  17.56    170      2  296.2  346.9  299.3\n 1002.0     78   24.6   21.6     83  16.51    252      4  297.6  345.6  300.5\n 1000.0     96   24.4   21.3     83  16.23    275      4  297.6  344.8  300.4\n  962.0    434   22.9   20.0     84  15.56    235     10  299.4  345.0  302.1\n  925.0    777   21.4   18.7     85  14.90    245     11  301.2  345.2  303.9\n  887.0   1142   20.3   16.0     76  13.04    255     15  303.7  342.7  306.1\n  850.0   1512   19.2   13.2     68  11.34    230     17  306.2  340.6  308.3\n"

It appears to be fixed-width, with lines compacted into a single string using the \n delimiter, so let's split it up:

strsplit(txt, "\n")
# [[1]]
#  [1] ""                                                                             
#  [2] "-----------------------------------------------------------------------------"
#  [3] "   PRES   HGHT   TEMP   DWPT   RELH   MIXR   DRCT   SKNT   THTA   THTE   THTV"
#  [4] "    hPa     m      C      C      %    g/kg    deg   knot     K      K      K "
#  [5] "-----------------------------------------------------------------------------"
#  [6] " 1009.0     16   23.8   22.7     94  17.56    170      2  296.2  346.9  299.3"
#  [7] " 1002.0     78   24.6   21.6     83  16.51    252      4  297.6  345.6  300.5"
#  [8] " 1000.0     96   24.4   21.3     83  16.23    275      4  297.6  344.8  300.4"
#  [9] "  962.0    434   22.9   20.0     84  15.56    235     10  299.4  345.0  302.1"
# [10] "  925.0    777   21.4   18.7     85  14.90    245     11  301.2  345.2  303.9"
# [11] "  887.0   1142   20.3   16.0     76  13.04    255     15  303.7  342.7  306.1"
# [12] "  850.0   1512   19.2   13.2     68  11.34    230     17  306.2  340.6  308.3"

It seems that row 1 is empty, and 2 and 5 are lines that need to be removed. Rows 3-4 appear to be the column header and units, respectively; since R doesn't allow multi-row headers, I'll remove the units, and leave it to you to save them elsewhere if you need them.

From here, it's a straight-forward call (noting the [[1]] for strsplit's returned list):

read.table(text=strsplit(txt, "\n")[[1]][-c(1,2,4,5)], header=TRUE)
#   PRES HGHT TEMP DWPT RELH  MIXR DRCT SKNT  THTA  THTE  THTV
# 1 1009   16 23.8 22.7   94 17.56  170    2 296.2 346.9 299.3
# 2 1002   78 24.6 21.6   83 16.51  252    4 297.6 345.6 300.5
# 3 1000   96 24.4 21.3   83 16.23  275    4 297.6 344.8 300.4
# 4  962  434 22.9 20.0   84 15.56  235   10 299.4 345.0 302.1
# 5  925  777 21.4 18.7   85 14.90  245   11 301.2 345.2 303.9
# 6  887 1142 20.3 16.0   76 13.04  255   15 303.7 342.7 306.1
# 7  850 1512 19.2 13.2   68 11.34  230   17 306.2 340.6 308.3

if I assign this data <- read.table(text=strsplit(txt, "\n")[[1]][-c(1,2,4,5)], header=TRUE) then check it class(data), the result is function . This confuse me a bit, or this is more close to list actually. — Siti Sal, Sep 27 '18 at 11:47
That's partly your fault: `data` is a base function, so if your call to `data <- ...` fails, the interpreter still finds an object named `data`. This is one reason I always discourage the use of `data` as a variable name; perhaps `dat`, never `data`. So this tells me there is something else about your data that we don't know. Please make this question *reproducible*, with fully usable sample data (e.g., output from `dput(head(x))`). Refs: https://stackoverflow.com/questions/5963269, https://stackoverflow.com/help/mcve, and https://stackoverflow.com/tags/r/info. — r2evans, Sep 27 '18 at 18:46

score 1 · Accepted Answer · answered Sep 27 '18 at 14:45

We can extend your base code and treat the web page as an API endpoint since it takes parameters:

library(httr)
library(rvest)

I use more than ^^ below via :: but I don't want to pollute the namespace.

I'd usually end up writing a small, parameterized function or small package with a cpl parameterized functions to encapsulate the logic below.

httr::GET(
  url = "http://weather.uwyo.edu/cgi-bin/sounding",
  query = list(
    region = "seasia",
    TYPE = "TEXT:LIST",
    YEAR = "2006",
    MONTH = "09",
    FROM = "0100",
    TO = "0100",
    STNM = "48657"
  )
) -> res

^^ makes the web page request and gathers the response.

httr::content(res, as="parsed") %>%
  html_nodes("pre") -> wx_dat

^^ turns it into an html_document

Now, we extract the readings:

html_text(wx_dat[[1]]) %>%           # turn the first <pre> node into text
  strsplit("\n") %>%                 # split it into lines
  unlist() %>%                       # turn it back into a character vector
  { col_names <<- .[3]; . } %>%      # pull out the column names (we'll use them later)
  .[-(1:5)] %>%                      # strip off the header
  paste0(collapse="\n") -> readings  # turn it back into a big text blob

^^ cleaned up the table and we'll use readr::read_table() to parse it. We'll also turn the extract column names into the actual colum names:

readr::read_table(readings, col_names = tolower(unlist(strsplit(trimws(col_names), "\ +"))))
## # A tibble: 106 x 11
##     pres  hght  temp  dwpt  relh  mixr  drct  sknt  thta  thte  thtv
##    <dbl> <int> <dbl> <dbl> <int> <dbl> <int> <int> <dbl> <dbl> <dbl>
##  1  1009    16  23.8  22.7    94 17.6    170     2  296.  347.  299.
##  2  1002    78  24.6  21.6    83 16.5    252     4  298.  346.  300.
##  3  1000    96  24.4  21.3    83 16.2    275     4  298.  345.  300.
##  4   962   434  22.9  20      84 15.6    235    10  299.  345   302.
##  5   925   777  21.4  18.7    85 14.9    245    11  301.  345.  304.
##  6   887  1142  20.3  16      76 13.0    255    15  304.  343.  306.
##  7   850  1512  19.2  13.2    68 11.3    230    17  306.  341.  308.
##  8   839  1624  18.8  11.8    64 10.5    225    17  307   339.  309.
##  9   828  1735  18    11.4    65 10.3    220    17  307.  339.  309.
## 10   789  2142  15.1  10      72  9.84   205    16  308.  339.  310.
## # ... with 96 more rows

You didn't say you wanted the station metadata but we can get that too (in the second <pre>:

html_text(wx_dat[[2]]) %>%
  strsplit("\n") %>%
  unlist() %>%
  trimws() %>%       # get rid of whitespace
  .[-1] %>%          # blank line removal
  strsplit(": ") %>% # separate field and value
  lapply(function(x) setNames(as.list(x), c("measure", "value"))) %>% # make each pair a named list
  dplyr::bind_rows() -> metadata # turn it into a data frame

metadata
## # A tibble: 30 x 2
##    measure                                 value      
##    <chr>                                   <chr>      
##  1 Station identifier                      WMKD       
##  2 Station number                          48657      
##  3 Observation time                        060901/0000
##  4 Station latitude                        3.78       
##  5 Station longitude                       103.21     
##  6 Station elevation                       16.0       
##  7 Showalter index                         0.34       
##  8 Lifted index                            -1.40      
##  9 LIFT computed using virtual temperature -1.63      
## 10 SWEAT index                             195.39     
## # ... with 20 more rows

this is another level for me.. what do you mean "pollute the namespace"? — Siti Sal, Sep 27 '18 at 14:51
`library()` or `require()` load all the exported functions & objects from a package. Since I'm just using one function from `readr` and one from `dplyr` there's no need to do that. If you start R in a clean session and do `library(dplyr)` you'll see that it says a bunch of functions are "masked", which means `dplyr` clobbers the namepace. It's usually not a "bad" thing (e.g. I never used `stats::filter` so `dplyr` clobbering that isn't bad) but when I just need to use a function or two from some pkg these days I tend to do `pkg::function()` vs load the whole package in. — hrbrmstr, Sep 27 '18 at 15:42
`readr::read_table(readings, col_names = tolower(unlist(strsplit(trimws(col_names), "\ +"))))` why there is + sign in the strsplit delimiter? — Siti Sal, Sep 27 '18 at 16:45
That signifies "one or more spaces". I rly shld have uses `"[[:space:]]+"` as I believe the extended POSIX regular expression notation is way clearer. Anyway, the header words have multiple spaces between them, so that splits them cleanly. — hrbrmstr, Sep 27 '18 at 17:38
if you are interested you may look into this as well.. thank you for your time https://stackoverflow.com/q/52543892/7356308 — Siti Sal, Sep 27 '18 at 19:30

Save content in web as data.frame

2 Answers2

Linked