0

Using either rvest or RSelenium when you scrape the links in R, you are able to do it by defining the begining part of HTML code, e.g. a href within given node. What if I face the following link:

<a href="www.website.com" data-tracking="click_body" data-tracking- 
data='{"touch_point_button":"photo"}' data-featured-name="listing_no_promo" >

If I would like to grab no promo links, then I would use (from XML and httr package) the following piece of code:

library(XML)
library(httr)
response <- GET(yourLink)
parsedoc <- htmlParse(response)
xpathSApply(parsedoc, "//a[@data-featured-tracking='listing_no_promo']", 
xmlGetAttr, "href")

What should I do in case when I would like to obtain link which ends with 'photo' part of xpath:

data-tracking- data='{"touch_point_button":"photo"}'

not caring about promo or no promo part? My guess is that curly brackets are making here some noise.

simbabque
  • 53,749
  • 8
  • 73
  • 136
M_D
  • 287
  • 3
  • 13
  • The value of the `data-tracking-data` attribute is a JSON data structure. Parsing that with regex is going to be tricky if you cannot guarantee that the ones you want to match always look like this. – simbabque Aug 02 '18 at 07:41

2 Answers2

0

I'm assuming your example link structure is actually as follows (where data-tracking-data is the actual attribute:

<a href="www.website.com" data-tracking="click_body" data-tracking-data=\'{"touch_point_button":"photo"}\' data-featured-name="listing_no_promo">link</a>

Since I don't know what site you are working with I recreated an html document by adding your link to the body of this page:

# I'm going to use the jsonlite and xml2 packages

library(jsonlite)
library(xml2)

# This page
stack_url <- "https://stackoverflow.com/questions/40934644/xpath-for-element-whose-attribute-value-ends-with-a-specific-string"

# Your html element example
test_a <- '<a href="www.website.com" data-tracking="click_body" data-tracking-data=\'{"touch_point_button":"photo"}\' data-featured-name="listing_no_promo" >link</a>'

# read in stackoverflow page
raw_page <- read_html(stack_url)
# read in the element a
raw_a <- read_html(test_a)

# add the link element from example to raw_page
xml_add_child(raw_page, raw_a)
# This is just to show that the tag you provided is mixed in with multiple link elements... since this would be the case in your actual use i assume
xml_find_all(raw_page,".//a") %>% tail()

{xml_nodeset (6)}
[1] <a href="https://www.facebook.com/officialstackoverflow/" class="-link">Facebook</a>
[2] <a href="https://twitter.com/stackoverflow" class="-link">Twitter</a>
[3] <a href="https://linkedin.com/company/stack-overflow" class="-link">LinkedIn</a>
[4] <a href="https://creativecommons.org/licenses/by-sa/3.0/" rel="license">cc by-sa 3.0</a>
[5] <a href="https://stackoverflow.blog/2009/06/25/attribution-required/" rel="license">attribution required</a>
[6] <a href="www.website.com" data-tracking="click_body" data-tracking-data='{"touch_point_button":"photo"}' data-f ...

So our xml_document is now stored to raw_page which we will then use an xpath to find what we want

.//a[attribute::*[contains(.,'{') or contains(.,'photo')] and @data-tracking]

# Our xpath pattern reads as:
# 
# - .//a[ -> find all 'a' html elements where
# - attribute::*[contains(.,'{') or contains(.,'photo')] -> any(*) attribute containing either a '{' OR the string 'photo'
# - and @data-tracking -> and the element must have the attribute data-tracking, but it doesn't matter what the value is
# - ] -> end

In short-order:
Find all links that have an attribute of data-tracking AND who have an attribute containing the word photo OR the character {.

our_xpath <- ".//a[attribute::*[contains(.,'{') or contains(.,'photo')] and @data-tracking]"
# Extract all of the matching elements using our xpath
# Get all the attribute values for data-tracking-data
# Parse from JSON
xml_find_all(raw_page,our_xpath) %>% xml_attr("data-tracking-data") %>% fromJSON()

Which results in:

$touch_point_button
[1] "photo"

I have no way to test against your page... but if you post the url i'd be happy to make sure it works accordingly.

Carl Boneri
  • 2,632
  • 1
  • 13
  • 15
  • Hi, this is about polish property website www.otodom.pl, please find below sample link: https://www.otodom.pl/sprzedaz/mieszkanie/warszawa/?search%5Bdist%5D=0&search%5Bsubregion_id%5D=197&search%5Bcity_id%5D=26 – M_D Oct 09 '18 at 14:04
0
//*[ends-with(@data-tracking-data, '"photo"}')]/@href

From your example, This xpath will give you href attribute if data-tacking-data ends with the string "photo"}

Thiyaga
  • 36
  • 5