0

As a practice project, I am trying to scrape property data from a website. (I only intend to practice my web scraping skills with no intention to further take advantage of the data scraped). But I found that some properties don't have price available, therefore, this creates an error of different length when I am trying to combine them into one data frame.

Here is the code for scraping:

library(tidyverse)
library(revest)

web_page <- read_html("https://wx.fang.anjuke.com/loupan/all/a1_p2/")

community_name <- web_page %>% 
  html_nodes(".items-name") %>% 
  html_text()

length(community_name)

listed_price <- web_page %>% 
  html_nodes(".price") %>% 
  html_text()

length(listed_price)
property_data <- data.frame(
  name=community_name,
  price=listed_price
)

How can I identity the property with no listed price and fill the price variable with NA when there is no value scraped?

halfer
  • 19,824
  • 17
  • 99
  • 186
Felix Zhao
  • 459
  • 5
  • 9
  • Does this answer your question? [How do you scrape items together so you don't lose the index?](https://stackoverflow.com/questions/56673908/how-do-you-scrape-items-together-so-you-dont-lose-the-index) or this one: https://stackoverflow.com/questions/62479704/webscraping-html-tables-with-variable-length-how-do-i-make-sure-my-data-ends-u/62480248#62480248 – Dave2e Oct 27 '20 at 00:51

1 Answers1

1

Inspection of the web page shows that the class is .price when price has a value, and .price-txt when it does not. So one solution is to use an XPath expression in html_nodes() and match classes that start with "price":

listed_price <- web_page %>% 
  html_nodes(xpath = "//p[starts-with(@class, 'price')]") %>% 
  html_text()

length(listed_price)
[1] 60
neilfws
  • 32,751
  • 5
  • 50
  • 63
  • Thank you very much for your help, and it worked very well. I need learn more about xpath, do you happen have any good materials with regard to xpath? Thank you neilfws! – Felix Zhao Oct 27 '20 at 05:08
  • I use CSS selectors more than XPath, but it is useful in this case. I don't have any particular recommendations other than Google search :) perhaps [this is a good start point](https://www.w3schools.com/xml/xpath_syntax.asp) – neilfws Oct 27 '20 at 05:16