This is not elegant and probably not very robust, but it should work for this case.
The first 4 lines after the require
calls retrieve the URL and extract the text. The grep
returns a TRUE
or FALSE
depending on whether the string we are looking for has been found, which
converts that to an index in the list. We increment this by 1 because if you look at cleantext
you will see that the date updated is the next element in the list after the string "Date Updated". So the +1
gets us the element after "Date Updated". The gsub
lines just clean up the strings.
The problem with the "P27M" is that it is not anchored to anything - it is just free text floating about in an arbitrary position. If you are sure that the price is always going to be a "P" followed by 1 to 3 digits, followed by an "M" AND that you only have one such string in the page, then a grep or regex would work, otherwise tough to get.
require(XML)
require(RCurl)
myurl <- 'http://www.sulit.com.ph/index.php/view+classifieds/id/3991016/BEAUTIFUL+AYALA+HEIGHTS+QC+HOUSE+FOR+SALE'
mytext <- getURL(myurl)
myhtml <- htmlTreeParse(mytext, useInternal = TRUE)
cleantext <- xpathApply(myhtml, "//body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]", xmlValue)
cleantext <- cleantext[!cleantext %in% " "]
cleantext <- gsub(" "," ", cleantext)
date_updated <- cleantext[[which(grepl("Date Updated",cleantext))+1]]
date_posted <- cleantext[[which(grepl("Date Posted",cleantext))+1]]
date_posted <- gsub("^[[:space:]]+|[[:space:]]+$","",date_posted)
date_updated <- gsub("^[[:space:]]+|[[:space:]]+$","",date_updated)
print(date_updated)
print(date_posted)