Scrape using Perl regex in R

Question

Using either rvest or RSelenium when you scrape the links in R, you are able to do it by defining the begining part of HTML code, e.g. a href within given node. What if I face the following link:

<a href="www.website.com" data-tracking="click_body" data-tracking- 
data='{"touch_point_button":"photo"}' data-featured-name="listing_no_promo" >

If I would like to grab no promo links, then I would use (from XML and httr package) the following piece of code:

library(XML)
library(httr)
response <- GET(yourLink)
parsedoc <- htmlParse(response)
xpathSApply(parsedoc, "//a[@data-featured-tracking='listing_no_promo']", 
xmlGetAttr, "href")

What should I do in case when I would like to obtain link which ends with 'photo' part of xpath:

data-tracking- data='{"touch_point_button":"photo"}'

not caring about promo or no promo part? My guess is that curly brackets are making here some noise.

The value of the `data-tracking-data` attribute is a JSON data structure. Parsing that with regex is going to be tricky if you cannot guarantee that the ones you want to match always look like this. — simbabque, Aug 02 '18 at 07:41

score 0 · Answer 1 · answered Sep 10 '18 at 13:32

I'm assuming your example link structure is actually as follows (where data-tracking-data is the actual attribute:

<a href="www.website.com" data-tracking="click_body" data-tracking-data=\'{"touch_point_button":"photo"}\' data-featured-name="listing_no_promo">link</a>

Since I don't know what site you are working with I recreated an html document by adding your link to the body of this page:

# I'm going to use the jsonlite and xml2 packages

library(jsonlite)
library(xml2)

# This page
stack_url <- "https://stackoverflow.com/questions/40934644/xpath-for-element-whose-attribute-value-ends-with-a-specific-string"

# Your html element example
test_a <- '<a href="www.website.com" data-tracking="click_body" data-tracking-data=\'{"touch_point_button":"photo"}\' data-featured-name="listing_no_promo" >link</a>'

# read in stackoverflow page
raw_page <- read_html(stack_url)
# read in the element a
raw_a <- read_html(test_a)

# add the link element from example to raw_page
xml_add_child(raw_page, raw_a)
# This is just to show that the tag you provided is mixed in with multiple link elements... since this would be the case in your actual use i assume
xml_find_all(raw_page,".//a") %>% tail()

{xml_nodeset (6)}
[1] <a href="https://www.facebook.com/officialstackoverflow/" class="-link">Facebook</a>
[2] <a href="https://twitter.com/stackoverflow" class="-link">Twitter</a>
[3] <a href="https://linkedin.com/company/stack-overflow" class="-link">LinkedIn</a>
[4] <a href="https://creativecommons.org/licenses/by-sa/3.0/" rel="license">cc by-sa 3.0</a>
[5] <a href="https://stackoverflow.blog/2009/06/25/attribution-required/" rel="license">attribution required</a>
[6] <a href="www.website.com" data-tracking="click_body" data-tracking-data='{"touch_point_button":"photo"}' data-f ...

So our xml_document is now stored to raw_page which we will then use an xpath to find what we want

.//a[attribute::*[contains(.,'{') or contains(.,'photo')] and @data-tracking]

# Our xpath pattern reads as:
# 
# - .//a[ -> find all 'a' html elements where
# - attribute::*[contains(.,'{') or contains(.,'photo')] -> any(*) attribute containing either a '{' OR the string 'photo'
# - and @data-tracking -> and the element must have the attribute data-tracking, but it doesn't matter what the value is
# - ] -> end

In short-order:
Find all links that have an attribute of data-tracking AND who have an attribute containing the word photo OR the character {.

our_xpath <- ".//a[attribute::*[contains(.,'{') or contains(.,'photo')] and @data-tracking]"
# Extract all of the matching elements using our xpath
# Get all the attribute values for data-tracking-data
# Parse from JSON
xml_find_all(raw_page,our_xpath) %>% xml_attr("data-tracking-data") %>% fromJSON()

Which results in:

$touch_point_button
[1] "photo"

I have no way to test against your page... but if you post the url i'd be happy to make sure it works accordingly.

Hi, this is about polish property website www.otodom.pl, please find below sample link: https://www.otodom.pl/sprzedaz/mieszkanie/warszawa/?search%5Bdist%5D=0&search%5Bsubregion_id%5D=197&search%5Bcity_id%5D=26 — M_D, Oct 09 '18 at 14:04

score 0 · Answer 2 · answered Sep 10 '18 at 15:05

0

//*[ends-with(@data-tracking-data, '"photo"}')]/@href

From your example, This xpath will give you href attribute if data-tacking-data ends with the string "photo"}

answered Sep 10 '18 at 15:05

Thiyaga

36
5

Scrape using Perl regex in R

2 Answers2