0

I'm trying to get the values of 'Dated Posted' and 'Date Updated' as pictured here. The website url is: http://sulit.com.ph/3991016

I have a feeling I should be using xpathSApply, as suggested in this thread Web Scraping (in R?), but I just can't get it to work.

url = "http://sulit.com.ph/3991016"
doc = htmlTreeParse(url, useInternalNodes = T)

date_posted = xpathSApply(doc, "??????????", xmlValue)

Also does anyone know a quick way to get the phrase 'P27M' also listed in the website? Help would be appreciated.

Community
  • 1
  • 1
Paolo
  • 1,557
  • 3
  • 18
  • 28

2 Answers2

3

Here's another way to do it.

> require(XML)
> 
> url = "http://www.sulit.com.ph/index.php/view+classifieds/id/3991016/BEAUTIFUL+AYALA+HEIGHTS+QC+HOUSE+FOR+SALE"
> doc = htmlParse(url)
> 
> dates = getNodeSet(doc, "//span[contains(string(.), 'Date Posted') or contains(string(.), 'Date Updated')]")
> dates = lapply(dates, function(x){
+         temp = xmlValue(xmlParent(x)["span"][[2]])
+         strptime(gsub("^[[:space:]]+|[[:space:]]+$", "", temp), format = "%B %d, %Y")
+ 
+ })
> dates
[[1]]
[1] "2012-07-05"

[[2]]
[1] "2011-08-11"

There's no need to use RCurl as htmlParse will parse urls. getNodeSet will return a list with the nodes that have "Date Posted" or "Date Updated" as values. The lapply loops over both of those nodes and first finds the parent node then the value of the second "span" node. This part may not be very robust if the website changes its formatting for different pages (which after looking at the html for that site seems very possible). SlowLearner's gsub cleans up both dates. I added strptime to return the dates as a date class, but that step is optional and depends on how you plan to use the info in the future. HTH

sayhey69
  • 1,089
  • 1
  • 9
  • 15
  • I'm sure there's a way to do the xmlParent part in the xpath (ancestors:: or parent:: ??) in getNodeSet which would remove a line of code and make using xpathApply a good choice, but I don't know how to do it. May be someone else knows? – sayhey69 Jul 14 '12 at 18:24
2

This is not elegant and probably not very robust, but it should work for this case.

The first 4 lines after the require calls retrieve the URL and extract the text. The grep returns a TRUE or FALSE depending on whether the string we are looking for has been found, which converts that to an index in the list. We increment this by 1 because if you look at cleantext you will see that the date updated is the next element in the list after the string "Date Updated". So the +1 gets us the element after "Date Updated". The gsub lines just clean up the strings.

The problem with the "P27M" is that it is not anchored to anything - it is just free text floating about in an arbitrary position. If you are sure that the price is always going to be a "P" followed by 1 to 3 digits, followed by an "M" AND that you only have one such string in the page, then a grep or regex would work, otherwise tough to get.

require(XML)
require(RCurl)

myurl <- 'http://www.sulit.com.ph/index.php/view+classifieds/id/3991016/BEAUTIFUL+AYALA+HEIGHTS+QC+HOUSE+FOR+SALE'
mytext <- getURL(myurl)
myhtml <- htmlTreeParse(mytext, useInternal = TRUE)
cleantext <- xpathApply(myhtml, "//body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]", xmlValue)

cleantext <- cleantext[!cleantext %in% " "]
cleantext <- gsub("  "," ", cleantext)

date_updated <- cleantext[[which(grepl("Date Updated",cleantext))+1]]
date_posted <- cleantext[[which(grepl("Date Posted",cleantext))+1]]
date_posted <- gsub("^[[:space:]]+|[[:space:]]+$","",date_posted)
date_updated <- gsub("^[[:space:]]+|[[:space:]]+$","",date_updated)

print(date_updated)
print(date_posted)
SlowLearner
  • 7,907
  • 11
  • 49
  • 80