2

I am trying to scrape some data from yahoo finance. Usually I have no problem doing this. Today however, I have run into a problem trying to pull a certain container. What might be the reason this is giving me such a difficult time?

I have tried many combos of xpaths. Selector gadget for some reason can not pick up the xpath. I have posted some attempts and the url below.

The green aea is what I am trying to bring into my console.

enter image description here

library(tidyverse)
library(rvest)
library(httr)

read_html("https://ca.finance.yahoo.com/quote/SPY/holdings?p=SPY") %>% html_nodes(xpath = '//*[@id="Col1-0-Holdings-Proxy"]/section/div[1]/div[1]') 

{xml_nodeset (0)}

#When I search for all tables using the following function.
read_html("https://finance.yahoo.com/quote/xlk/holdings?p=xlk") %>% html_nodes("table") %>% .[1] %>% html_table(fill = T)

I get the table at the bottom of the page. Trying different numbers in the [] leads to errors.

What am I doing wrong? This seems like such an easy scrape. Thanks a bunch for your help.

Jordan Wrong
  • 1,205
  • 1
  • 12
  • 32

1 Answers1

2

Your data doesn't reside within an actual html table.

You could use the following css selectors currently - though a lot of the page looks dynamic and I suspect attributes and classes will change in future. I tried to keep a little more generic to compensate but you should definitely seek to make this even more generic if possible.

I use css selectors throughout for the flexibility and specificity gained. The [] denote attribute selectors, the . denotes class selector, * is the contains operator specifiying that the left hand side attribute's value contains the right hand side string e.g. with [class*=screenerBorderGray] this means the class attribute contains the stringscreenerBorderGray.

The " " ,">" , "+" between selectors are called combinators and are used to specify relationships between nodes matched by consecutive parts of the selector sequence.

I generate a left column list of nodes and a right column list of nodes (ignoring the chart col in between). I then join these into a final dataframe.


R

library(rvest)
library(magrittr)

pg <- read_html('https://finance.yahoo.com/quote/xlk/holdings?p=xlk&guccounter=1')

lhs <- pg %>% 
  html_nodes('[id*=Holdings] section > .Fl\\(start\\) [class*=screenerBorderGray] > span:nth-child(1)') %>% 
  html_text()

rhs <- pg %>% 
  html_nodes('[id*=Holdings] section > .Fl\\(start\\) [class*=screenerBorderGray] span + span:last-child') %>% 
  html_text()

df <- data.frame(lhs,rhs) %>% set_names(., c('Title','value'))
df <- df[-c(3),] 
rownames(df) <- NULL
print(df)

enter image description here


Py

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

r = requests.get('https://finance.yahoo.com/quote/xlk/holdings?p=xlk&guccounter=1')
soup = bs(r.content, 'lxml')
lhs = [i.text.strip() for i in soup.select('[id*=Holdings] section > .Fl\(start\) .Bdbc\(\$screenerBorderGray\) > span:nth-child(1)')]
rhs = [i.text.strip() for i in soup.select('[id*=Holdings] section > .Fl\(start\) .Bdbc\(\$screenerBorderGray\) span + span:last-child')]
df = pd.DataFrame(zip(lhs, rhs), columns = ['Title','Value'])
df = df.drop([2]).reset_index(drop = True)
print(df)

References:

  1. Row re-numbering @thelatemail
QHarr
  • 83,427
  • 12
  • 54
  • 101