Get all the twitter links in a web page using RSelenium

Question

I am trying to collect URLs from a webpage with Rselenium, but getting InvalidSelector error

Use R 3.6.0 on Windows 10 PC, Rselenium 1.7.5 with Chrome webdriver (chromever="75.0.3770.8")


library(RSelenium)

rD <- rsDriver(browser=c("chrome"), chromever="75.0.3770.8")
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()

url <- "https://www.aph.gov.au/Senators_and_Members/Parliamentarian_Search_Results?q=&mem=1&par=1&gen=0&ps=96"
remDr$navigate(url)

tt <- remDr$findElements(using = "xpath", "//a[contains(@href,'http://twitter.com/')]/@href")

I expect to collect URLs to Twitter accounts of politicians listed. Instead I am getting the next error:

Selenium message:

invalid selector: The result of the xpath expression "//a[contains(@href,'http://twitter.com/')]/@href" is: [object Attr]. It should be an element.
  (Session info: chrome=75.0.3770.80)
For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/invalid_selector_exception.html
Build info: version: '4.0.0-alpha-1', revision: 'd1d3728cae', time: '2019-04-24T16:15:24'
System info: host: 'ALEX-DELL-17', ip: '10.0.75.1', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_191'
Driver info: driver.version: unknown

Error: Summary: InvalidSelector Detail: Argument was an invalid selector (e.g. XPath/CSS). class: org.openqa.selenium.InvalidSelectorException Further Details: run errorDetails method

When I make a similar search for very specific element all works fine, example:

tt <- remDr$findElement(value = '//a[@href = "http://twitter.com/AlboMP"]')

then

tt$getElementAttribute('href')

returns me URL I need

What am I doing wrong?

When I checked the site, there is no element with such xpath. Can you try it with css_selector or link text? — Prasanth Ganesan, Jun 13 '19 at 08:44
The code should look for all links to twitter, it suppose to find all URLs that include http://twitter.com The bottom code with strict search condition (concrete twitter URL) works fine. Maybe the issue is with that soft xpath syntax, but I don't see what is wrong there. — Alex, Jun 13 '19 at 09:34

undetected Selenium · Answer 1 · 2019-06-13T10:04:14.923

1

This error message...

invalid selector: The result of the xpath expression "//a[contains(@href,'http://twitter.com/')]/@href" is: [object Attr]. It should be an element.

......implies that your XPath expression was not a valid one.

The xpath expression:

//a[contains(@href,'http://twitter.com/')]/@href

doesn't return an element. It would return a [object Attr]. While this was acceptable using Selenium RC but the methods of WebDriver's WebElement interface requires an element object, not just any DOM node object.

To sum it up, Selenium still doesn't supports this format. and to fix the issue, you'd need to change the HTML markup to wrap the text node inside an element, like a .

Solution

To fix this issue you need to use findElements and create a List:

findElements(value = '//a[@href = "http://twitter.com/AlboMP"]')

Now, you can iterate over the List and using getElementAttribute('href') method you can extract the URLs.

Reference

InvalidSelectorError: The result of the xpath expression is: [object Text]

edited Jun 13 '19 at 10:04

answered Jun 13 '19 at 09:51

undetected Selenium

183,867
41
278
352

Hi Debanjan, thanks for your answer.
I know that the code ` findElements(value = '//a[@href = "http://twitter.com/AlboMP"]') ` works.
However it doesn't solve my task - I need to extract URLs that I don't know exactly, but matching certain pattern, namely contain 'twitter.com' in URL. Hence I use the expression with 'contain'.
The code above returns just one URL. – Alex Jun 13 '19 at 09:59
@Alex The error which you were seeing `invalid selector` and the task to _extract URLs_ are completely two different aspects. In this answer I have provided you the canonical answer on how to avoid the `invalid selector` error. – undetected Selenium Jun 13 '19 at 10:23
Your answer effectively duplicates the final part of my question, I already new that the code findElements(value = '//a[@href = "http://twitter.com/AlboMP"]') worked – Alex Jun 13 '19 at 10:42

Prasanth Ganesan · Accepted Answer · 2019-12-17T11:56:57.993

I don't anything about R so I am posting an answer with python. As this post is about R, I learned some R basics and posting it too.

The easiest way to get the twitter URL is by iterating through all the URLs in the webpage and check if it contains the word 'twitter' in it.

In python (which works absolutely fine):

driver.get('https://www.aph.gov.au/Senators_and_Members/Parliamentarian_Search_Results?q=&mem=1&par=1&gen=0&ps=96')
links = driver.find_elements_by_xpath("//a[@href]")
for link in links:
    if 'twitter' in link.get_attribute("href"):
        print(link.get_attribute("href")

Result:

http://twitter.com/AlboMP http://twitter.com/SharonBirdMP
http://twitter.com/Bowenchris http://twitter.com/tony_burke
http://twitter.com/lindaburneymp http://twitter.com/Mark_Butler_MP
https://twitter.com/terrimbutler http://twitter.com/AnthonyByrne_MP
https://twitter.com/JEChalmers http://twitter.com/NickChampionMP
https://twitter.com/LMChesters http://twitter.com/JasonClareMP
https://twitter.com/SharonClaydon
https://www.twitter.com/LibbyCokerMP
https://twitter.com/JulieCollinsMP http://twitter.com/fitzhunter
http://twitter.com/stevegeorganas https://twitter.com/andrewjgiles
https://twitter.com/lukejgosling https://www.twitter.com/JulianHillMP http://twitter.com/stephenjonesalp https://twitter.com/gedkearney
https://twitter.com/MikeKellyofEM http://twitter.com/mattkeogh
http://twitter.com/PeterKhalilMP http://twitter.com/CatherineKingMP
https://twitter.com/MadeleineMHKing https://twitter.com/ALEIGHMP
https://twitter.com/RichardMarlesMP
https://twitter.com/brianmitchellmp
http://twitter.com/#!/RobMitchellMP
http://twitter.com/ShayneNeumannMP https://twitter.com/ClareONeilMP
http://twitter.com/JulieOwensMP
http://www.twitter.com/GrahamPerrettMP
http://twitter.com/tanya_plibersek http://twitter.com/AmandaRishworth http://twitter.com/MRowlandMP https://twitter.com/JoanneRyanLalor
http://twitter.com/billshortenmp http://www.twitter.com/annewerriwa
http://www.twitter.com/stemplemanmp
https://twitter.com/MThistlethwaite
http://twitter.com/MariaVamvakinou https://twitter.com/TimWattsMP
https://twitter.com/joshwilsonmp

In R: (This may be wrong but you can get an idea)

library(XML)
library(RCurl)
library(RSelenium)
url <- "https://www.aph.gov.au/Senators_and_Members/Parliamentarian_Search_Results?q=&mem=1&par=1&gen=0&ps=96"
doc <- getURL(url)
parser <- htmlParse(doc)
links <- xpathSApply(parser, "//a[@href]", xmlGetAttr, "href")
for(link in links){
    if(grepl("twitter", link)){
        print(link)
    }
}

I don't even know if this code will work. But the idea is to get all the URLs in a page, iterate over it and check if the word twitter is in it. My answer is based on this

Thanks @Prasanth! That approach, with "xpathSApply(parser, "//a[@href]", xmlGetAttr, "href")" works to find all URLs on the page, which indeed then can be filtered to include only a subset matching criteria. — Alex, Jun 14 '19 at 00:06

score 0 · Answer 3 · answered Oct 13 '22 at 14:21

0

Well, maybe a little late. But your solution may be taking the vector of links in this way:

links=RemDr$findElements(value = "//*[contains(@href, 'https://www.twitter.com/')]")

answered Oct 13 '22 at 14:21

lasagna

135
1
10

Get all the twitter links in a web page using RSelenium

3 Answers3

Solution

Reference