Finding The Xpath On Yahoo Finance Using R

Question

I am trying to scrape some data from yahoo finance. Usually I have no problem doing this. Today however, I have run into a problem trying to pull a certain container. What might be the reason this is giving me such a difficult time?

I have tried many combos of xpaths. Selector gadget for some reason can not pick up the xpath. I have posted some attempts and the url below.

The green aea is what I am trying to bring into my console.

library(tidyverse)
library(rvest)
library(httr)

read_html("https://ca.finance.yahoo.com/quote/SPY/holdings?p=SPY") %>% html_nodes(xpath = '//*[@id="Col1-0-Holdings-Proxy"]/section/div[1]/div[1]') 

{xml_nodeset (0)}

#When I search for all tables using the following function.
read_html("https://finance.yahoo.com/quote/xlk/holdings?p=xlk") %>% html_nodes("table") %>% .[1] %>% html_table(fill = T)

I get the table at the bottom of the page. Trying different numbers in the [] leads to errors.

What am I doing wrong? This seems like such an easy scrape. Thanks a bunch for your help.

Did you need the two headers as well? Or just the rows (excluding charts) ? — QHarr, Aug 01 '19 at 08:00
Just the rows ( the sector name and the associated value. ie Basic Materials - 2.48%) — Jordan Wrong, Aug 01 '19 at 08:17

QHarr · Accepted Answer · 2019-08-01T09:24:07.507

Your data doesn't reside within an actual html table.

You could use the following css selectors currently - though a lot of the page looks dynamic and I suspect attributes and classes will change in future. I tried to keep a little more generic to compensate but you should definitely seek to make this even more generic if possible.

I use css selectors throughout for the flexibility and specificity gained. The [] denote attribute selectors, the . denotes class selector, * is the contains operator specifiying that the left hand side attribute's value contains the right hand side string e.g. with [class*=screenerBorderGray] this means the class attribute contains the stringscreenerBorderGray.

The " " ,">" , "+" between selectors are called combinators and are used to specify relationships between nodes matched by consecutive parts of the selector sequence.

I generate a left column list of nodes and a right column list of nodes (ignoring the chart col in between). I then join these into a final dataframe.

R

library(rvest)
library(magrittr)

pg <- read_html('https://finance.yahoo.com/quote/xlk/holdings?p=xlk&guccounter=1')

lhs <- pg %>% 
  html_nodes('[id*=Holdings] section > .Fl\\(start\\) [class*=screenerBorderGray] > span:nth-child(1)') %>% 
  html_text()

rhs <- pg %>% 
  html_nodes('[id*=Holdings] section > .Fl\\(start\\) [class*=screenerBorderGray] span + span:last-child') %>% 
  html_text()

df <- data.frame(lhs,rhs) %>% set_names(., c('Title','value'))
df <- df[-c(3),] 
rownames(df) <- NULL
print(df)

Py

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

r = requests.get('https://finance.yahoo.com/quote/xlk/holdings?p=xlk&guccounter=1')
soup = bs(r.content, 'lxml')
lhs = [i.text.strip() for i in soup.select('[id*=Holdings] section > .Fl\(start\) .Bdbc\(\$screenerBorderGray\) > span:nth-child(1)')]
rhs = [i.text.strip() for i in soup.select('[id*=Holdings] section > .Fl\(start\) .Bdbc\(\$screenerBorderGray\) span + span:last-child')]
df = pd.DataFrame(zip(lhs, rhs), columns = ['Title','Value'])
df = df.drop([2]).reset_index(drop = True)
print(df)

References:

Row re-numbering @thelatemail

Thanks for the very in depth answer QHarr!! Life saver! – Jordan Wrong Aug 02 '19 at 23:11 — Jordan Wrong, Aug 02 '19 at 23:11

Finding The Xpath On Yahoo Finance Using R

1 Answers1