3

I'm trying to read the table on this site:

http://spacefem.com/pregnant/due.php?use=EDD&m=09&d=10&y=16

I use rvest, but quickly get an error:

library(rvest)
read_html("http://spacefem.com/pregnant/due.php?use=EDD&m=09&d=10&y=16")

Error: Name spoiler:3tbt4d3m is not XML Namespace compliant [202]

What does this error mean, and is there anything I can do to get around it?

I've gotten as far as pinpointing the internal function causing the error: xml2:::doc_parse_raw. However, xml2:::doc_parse_raw is simply a call to internal C code, making debugging of this issue substantially more difficult.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • 1
    It looks like there's a malformed tag in there that's causing problems, specifically ``. It doesn't contain the table, so you could probably regex it out if you grab the HTML with `httr::GET` or similar, but there may be a better way. – alistaire Sep 01 '16 at 23:50

2 Answers2

2

The HTML contains a malformed tag that's causing problems, specifically <spoiler:3tbt4d3m>, as the error suggests. If you grab the HTML with httr without parsing it, you can use regex to remove that tag and its contents without incident, as a quick look reveals that it doesn't contain the table.

library(httr)
library(rvest)

url <- 'http://spacefem.com/pregnant/due.php?use=EDD&m=09&d=10&y=16'

html <- url %>% GET(user_agent('R')) %>% content('text')

html2 <- gsub('<spoiler:3tbt4d3m>.*</spoiler:3tbt4d3m>', '', html)

df <- html2 %>% read_html() %>% 
    html_node(xpath = '//table[@border="1"]') %>% 
    # obviously insufficient to parse double headers, but at least the data exists now
    html_table(fill = TRUE)

df[1:5, 1:3]
##                        Date Progress Overall probability ofspontaneous labor
## 1                      Date Progress                            On this date
## 2 Saturday August 6th, 2016  35W, 0D                                   0.01%
## 3   Sunday August 7th, 2016  35W, 1D                                   0.01%
## 4   Monday August 8th, 2016  35W, 2D                                   0.02%
## 5  Tuesday August 9th, 2016  35W, 3D                                   0.02%

Mixing regex and HTML makes me a bit uneasy, so maybe there's a cleaner way of tidying, but before parsing I'm not sure what it would be.

alistaire
  • 42,459
  • 4
  • 77
  • 117
  • Thanks! Certainly not a very helpful error message for identifying the problem. Had no idea to even search for spoiler in the XML code... – MichaelChirico Sep 02 '16 at 00:13
  • Yeah, it only really makes sense in retrospect, and would be much improved by mentioning it's a tag. I only found it because I thought the `3tbt4d3m` part was odd, so I searched the HTML for it. – alistaire Sep 02 '16 at 00:17
  • With [this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) in mind, I'll hold off on accepting a regex-based answer for now, though it certainly works. – MichaelChirico Sep 02 '16 at 00:24
  • Ha, that post is ridiculous. And understood; I'd avoid it if I knew of an alternative. – alistaire Sep 02 '16 at 00:30
2

Another option is to use htmltidy (need to use v0.3.0 or higher which means—as of the date of this answer—using the development version vs CRAN version until CRAN is up to 0.3.0+) to "clean" the document:

library(rvest)
library(htmltidy) # devtools::install_github("hrbrmstr/htmltidy")
library(httr)

URL <- "http://spacefem.com/pregnant/due.php?use=EDD&m=09&d=10&y=16"

# the site was not returning content for me w/o a more browser-like user agent

res <- GET(URL, user_agent("Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36"))

cleaned <- tidy_html(content(res, as="text", encoding="UTF-8"),
                     list(TidyDocType="html5"))

pg <- read_html(cleaned)
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205