1

I'm trying to scraping:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By    
import pandas as pd

options = webdriver.FirefoxOptions()
options.binary_location = r'C://Mozilla Firefox/firefox.exe'
driver = selenium.webdriver.Firefox(executable_path='C://geckodriver.exe' , options=options)


url = 'https:/'
driver.get(url)

table = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.ID, 'dgtopHolders')))
dfs = pd.read_html(table.get_attribute('outerHTML'))
print(dfs[0])

How can I scrape this table ? thanks for your time

best

  • You could try saving the page to HTML and parsing it? Or is the issue that you have to go through the rows 50 at a time? There are only 52 rows in your specific example so this shouldn't be an issue? – Barry Carter Aug 04 '22 at 11:51
  • I am seeking a solution. Everythings is ok. The right solution is the uniwue request – Agustos Imola Aug 04 '22 at 11:54
  • the number of row is not a problem. i prefer 50 row, but it's not a problem – Agustos Imola Aug 04 '22 at 11:55
  • I don't find the [table](https://i.stack.imgur.com/Oww8E.png) either in [this](https://whalewisdom.com/filer/berkshire-hathaway-inc#tabholdings_tab_link) or [this](https://whalewisdom.com/filer/berkshire-hathaway-inc#tabholdings_tab_linksta) link, am I missing something obivious? – undetected Selenium Aug 04 '22 at 12:23
  • 1
    @undetectedSelenium You have to click a button which runs some Javascript. There's no direct link to the page (unless you cheat as per the answer(s) below). – Barry Carter Aug 04 '22 at 12:24
  • @BarryCarter Whao, I doubt if there is any way you can cheat with Selenium as Selenium strives for visibility of any element at the least. – undetected Selenium Aug 04 '22 at 12:26
  • @undetectedSelenium I was answering your question as to why the table wasn't in either link. There is no direct link to the table, at least not an obvious one. You can "cheat" by looking at what URLs the page itself calls when you click on Whitelist, but that's different – Barry Carter Aug 04 '22 at 12:27
  • @BarryCarter Your tip helped me to construct an answer. Thanks again for the tip. – undetected Selenium Aug 04 '22 at 12:46
  • tes! there is the table!!" https://whalewisdom.com/filer/berkshire-hathaway-inc#tabholdings_tab_link "let scroll the page – Agustos Imola Aug 04 '22 at 13:25

3 Answers3

0

If you would analyse the Network tab for that page loading, you would notice an api being accessed via an XHR call, pulling this data into page. A more elegant way of obtaining that data - all 52 rows - would be:

import requests
import pandas as pd

headers = {
    'Content-Type': 'application/json'
}
url = 'https://whalewisdom.com/filer/holdings?id=berkshire-hathaway-inc&q1=-1&type_filter=1,2,3,4&symbol=&change_filter=&minimum_ranking=&minimum_shares=&is_etf=0&sc=true&sort=current_mv&order=desc&offset=0&limit=99'
r = requests.get(url, headers=headers)
print(r.json())
df = pd.DataFrame(r.json()['rows'])
print(df[:10])

This would return:

symbol permalink security_type name sector industry current_shares previous_shares shares_change position_change_type percent_shares_change current_ranking previous_ranking current_percent_of_portfolio previous_percent_of_portfolio current_mv previous_mv stock_id percent_ownership quarter_first_owned quarter_id_owned source_type source_date filing_date avg_price recent_price quarter_end_price id
0 AAPL aapl SH Apple Inc INFORMATION TECHNOLOGY COMPUTERS & PERIPHERALS 8.90923e+08 8.87136e+08 3.78786e+06 addition 0.427 1 1 42.2022 47.5985 1.55564e+11 1.57529e+11 195 5.50456 Q1 2016 61 13F 2022-03-31 2022-05-16 36.6604 166.13 174.61
1 BAC bac SH Bank of America Corp. (North Carolina National Bank) FINANCE BANKS 1.0101e+09 1.0101e+09 0 0 2 2 11.2953 13.5788 4.16363e+10 4.49394e+10 205 12.5371 Q3 2017 67 13F 2022-03-31 2022-05-16 25.5185 33.64 41.22
2 AXP axp SH American Express Co FINANCE CONSUMER FINANCE 1.51611e+08 1.51611e+08 0 0 3 3 7.69125 7.4946 2.83512e+10 2.48035e+10 368 20.1326 Q1 2001 1 13F 2022-03-31 2022-05-16 39.311 155.43 187
3 CVX cvx SH Chevron Corp. (Standard Oil of California) ENERGY INTEGRATED OIL & GAS 1.59178e+08 3.8245e+07 1.20933e+08 addition 316.206 4 9 7.03142 1.3561 2.5919e+10 4.48806e+09 214 8.10144 Q4 2020 80 13F 2022-03-31 2022-05-16 125.342 155.36 162.83
4 KO ko SH Coca Cola Co. CONSUMER STAPLES BEVERAGES 4e+08 4e+08 0 0 5 4 6.72786 7.1563 2.48e+10 2.3684e+10 386 9.22716 Q1 2001 1 13F 2022-03-31 2022-05-16 27.1275 63.92 62
5 OXY oxy SH Occidental Petroleum Corp. ENERGY INTEGRATED OIL & GAS 2.26119e+08 2.20232e+08 5.88762e+06 addition 2.6734 6 999999 3.57628 nan 1.31828e+10 1.21326e+10 442 24.1274 Q1 2022 85 4 2022-05-02 2022-05-04 nan 60.99 nan
6 KHC khc SH Kraft Heinz Co. (The) CONSUMER STAPLES FOOD PRODUCTS 3.25635e+08 3.25635e+08 0 0 7 5 3.4797 3.5323 1.28268e+10 1.16903e+10 178038 26.6052 Q3 2015 59 13F 2022-03-31 2022-05-16 75.4858 37.34 39.39
7 MCO mco SH Moodys Corp FINANCE GENERAL FINANCE 2.46698e+07 2.46698e+07 0 0 8 6 2.25813 2.9114 8.32383e+09 9.63552e+09 2707 13.3712 Q1 2001 1 13F 2022-03-31 2022-05-16 13.7106 309.92 337.41
8 USB usb SH U.S. Bancorp (First National Bank of Cincinnati) FINANCE BANKS 1.26418e+08 1.26418e+08 0 0 9 8 1.82279 2.1456 6.71911e+09 7.10089e+09 471 8.50875 Q1 2006 21 13F 2022-03-31 2022-05-16 40.0713 47.66 53.15
9 ATVI atvi SH Activision Blizzard Inc INFORMATION TECHNOLOGY SOFTWARE 6.43152e+07 1.46581e+07 4.96571e+07 addition 338.769 10 24 1.39774 0.2947 5.15229e+09 9.75205e+08 11336 8.2257 Q4 2021 84 13F 2022-03-31 2022-05-16 72.9554 80.59 80.11
Barry the Platipus
  • 9,594
  • 2
  • 6
  • 30
0

This is working for me. You can access each element using 'item'.

try:
    WebDriverWait(driver, 4).until(EC.presence_of_element_located((By.XPATH, f"/html/body/div[11]/div/div/a/svg")))
except Exception as e:
    print(e)

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

list_of_tr = WebDriverWait(driver, 4).until(EC.presence_of_all_elements_located((By.XPATH, f"/html/body/div[4]/div[3]/div[1]/fieldset/div[2]/div[1]/div[3]/div[2]/table/tbody/tr")))
for item in list_of_tr:
    stock = item.find_element(By.XPATH, './td[1]/a').text.strip()
    print(stock)
Orloff
  • 16
  • 1
0

To scrape the data from the current holdings_table as the <table> is present with the HTML DOM but not visible within the webpage, you need to induce WebDriverWait for the presence_of_element_located() for the <table> element, extract the outerHTML, read the outerHTML using read_html() and you can use the following locator strategy:

  • Code Block:

    driver.execute("get", {'url': 'https://whalewisdom.com/filer/berkshire-hathaway-inc#tabholdings_tab_link'})
    # data = driver.find_element(By.CSS_SELECTOR, "table#current_holdings_table").get_attribute("outerHTML")
    data = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR, "table#current_holdings_table"))).get_attribute("outerHTML")
    df  = pd.read_html(data)
    print(df)
    
  • Console Output:

    [    Stock  Unnamed: 1                  Sector  ...  Source Source Date  Date Reported
    0    AAPL         NaN  INFORMATION TECHNOLOGY  ...     13F  2022-03-31     2022-05-16
    1     BAC         NaN                 FINANCE  ...     13F  2022-03-31     2022-05-16
    2     AXP         NaN                 FINANCE  ...     13F  2022-03-31     2022-05-16
    3     CVX         NaN                  ENERGY  ...     13F  2022-03-31     2022-05-16
    4      KO         NaN        CONSUMER STAPLES  ...     13F  2022-03-31     2022-05-16
    5     OXY         NaN                  ENERGY  ...       4  2022-05-02     2022-05-04
    6     KHC         NaN        CONSUMER STAPLES  ...     13F  2022-03-31     2022-05-16
    7     MCO         NaN                 FINANCE  ...     13F  2022-03-31     2022-05-16
    8     USB         NaN                 FINANCE  ...     13F  2022-03-31     2022-05-16
    9    ATVI         NaN  INFORMATION TECHNOLOGY  ...     13F  2022-03-31     2022-05-16
    10    HPQ         NaN  INFORMATION TECHNOLOGY  ...     13G  2022-04-30     2022-04-30
    11     BK         NaN                 FINANCE  ...     13F  2022-03-31     2022-05-16
    12     KR         NaN        CONSUMER STAPLES  ...     13F  2022-03-31     2022-05-16
    13    DVA         NaN             HEALTH CARE  ...     13D  2022-08-01     2022-08-01
    14      C         NaN                 FINANCE  ...     13F  2022-03-31     2022-05-16
    15   VRSN         NaN          COMMUNICATIONS  ...     13F  2022-03-31     2022-05-16
    16     GM         NaN  CONSUMER DISCRETIONARY  ...     13F  2022-03-31     2022-05-16
    17   PARA         NaN          COMMUNICATIONS  ...     13F  2022-03-31     2022-05-16
    18   CHTR         NaN          COMMUNICATIONS  ...     13F  2022-03-31     2022-05-16
    19  LSXMK         NaN          COMMUNICATIONS  ...     13F  2022-03-31     2022-05-16
    20      V         NaN                 FINANCE  ...     13F  2022-03-31     2022-05-16
    21   AMZN         NaN  CONSUMER DISCRETIONARY  ...     13F  2022-03-31     2022-05-16
    22    AON         NaN                 FINANCE  ...     13F  2022-03-31     2022-05-16
    23     MA         NaN                 FINANCE  ...     13F  2022-03-31     2022-05-16
    24   SNOW         NaN  INFORMATION TECHNOLOGY  ...     13F  2022-03-31     2022-05-16
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352