Why does Beautifulsoup findall by xpath returns empty result?

Question

In the below code I'm using the xpath //*[@id="economic-calendar-events"]/table/tbody/tr that I directly took from inspecting a table row element at https://www.fxstreet.com/economic-calendar. However I'm getting empty result from

soup.find_all('//*[@id="economic-calendar-events"]/table/tbody/tr')

Any idea why?

import json
import re
import time
from datetime import datetime
from pathlib import Path
from time import strptime
from selenium.webdriver.common.by import By

import undetected_chromedriver as uc
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager

URL = f"https://www.fxstreet.com/economic-calendar"
  
if __name__ == "__main__":  
  options = webdriver.ChromeOptions()    
  #options.add_argument("--headless")
  user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.5249.119 Safari/537.36'
  options.add_argument('user-agent={0}'.format(user_agent))
  driver = uc.Chrome(
  options=options,
  service=ChromeService(ChromeDriverManager().install()))
  driver.get(URL)
  time.sleep(7)

  page_source = driver.page_source
  print(page_source)
  soup = BeautifulSoup(page_source, 'html.parser')
  tr_elements = soup.find_all('//*[@id="economic-calendar-events"]/table/tbody/tr')  
  print('tr_elements size=' + str(len(tr_elements)))

HedgeHog · Answer 1 · 2023-08-03T08:17:19.553

1

Concerning BeautifulSoup and XPATH check following post: can we use XPath with BeautifulSoup?

Best practice would be the use of api's as described by @Barry the Platipus: https://stackoverflow.com/a/76826104/14460824

If you like to go with your approach, you could use your XPATH with selenium directly and iterate the results:

find_elements(By.XPATH, "xpath")

or use BeautifulSoup with css selectors:

soup.select('#economic-calendar-events table tbody tr')

or as mentioned in the post from the top use lxml

edited Aug 03 '23 at 08:17

answered Aug 03 '23 at 08:10

HedgeHog

22,146
4
14
36

Why, thank you Sir. – Barry the Platipus Aug 03 '23 at 08:36

score 1 · Accepted Answer · answered Aug 03 '23 at 08:12

You're getting an empty result because that data is being retrieved from an API endpoint by an XHR call, after page loads. Here is one way of getting that information by scraping that endpoint directly (you can find it under Dev tools -- Network tab), without the overheads of Selenium:

import requests
import pandas as pd

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)

headers= {
    'Accept': 'application/json',
    'Origin': 'https://www.fxstreet.com',
    'Referer': 'https://www.fxstreet.com/',
    'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36'
}

url = 'https://calendar-api.fxstreet.com/en/api/v1/eventDates/2023-08-03T06:02:55Z/2023-08-05T08:02:55Z?=&volatilities=NONE&volatilities=LOW&volatilities=MEDIUM&volatilities=HIGH&countries=US&countries=UK&countries=EMU&countries=DE&countries=CN&countries=JP&countries=CA&countries=AU&countries=NZ&countries=CH&countries=FR&countries=IT&countries=ES&countries=UA&categories=8896AA26-A50C-4F8B-AA11-8B3FCCDA1DFD&categories=FA6570F6-E494-4563-A363-00D0F2ABEC37&categories=C94405B5-5F85-4397-AB11-002A481C4B92&categories=E229C890-80FC-40F3-B6F4-B658F3A02635&categories=24127F3B-EDCE-4DC4-AFDF-0B3BD8A964BE&categories=DD332FD3-6996-41BE-8C41-33F277074FA7&categories=7DFAEF86-C3FE-4E76-9421-8958CC2F9A0D&categories=1E06A304-FAC6-440C-9CED-9225A6277A55&categories=33303F5E-1E3C-4016-AB2D-AC87E98F57CA&categories=9C4A731A-D993-4D55-89F3-DC707CC1D596&categories=91DA97BD-D94A-4CE8-A02B-B96EE2944E4C&categories=E9E957EC-2927-4A77-AE0C-F5E4B5807C16'
r = requests.get(url, headers=headers)
df = pd.json_normalize(r.json())
df = df[[x for x in df.columns if x not in ['id', 'eventId']]]
print(df)

Result:

    dateUtc     periodDateUtc   periodType  actual  revised     consensus   ratioDeviation  previous    isBetterThanExpected    name    countryCode     currencyCode    unit    potency     volatility  isAllDay    isTentative     isPreliminary   isReport    isSpeech    lastUpdated     previousIsPreliminary
0   2023-08-03T06:30:00Z    2023-07-01T00:00:00Z    MONTH   -0.10   NaN     -0.2    0.44543     0.100   True    Consumer Price Index (MoM)  CH  CHF     %   ZERO    MEDIUM  False   False   False   False   False   1691044615  None
1   2023-08-03T06:30:00Z    2023-07-01T00:00:00Z    MONTH   1.60    NaN     1.6     0.00000     1.700   None    Consumer Price Index (YoY)  CH  CHF     %   ZERO    HIGH    False   False   False   False   False   1691044574  None
2   2023-08-03T06:45:00Z    2023-06-01T00:00:00Z    MONTH   -116.18     NaN     NaN     NaN     -107.222    None    Budget Balance  FR  EUR     €   B   LOW     False   False   False   False   False   1691045513  None
3   2023-08-03T07:15:00Z    2023-07-01T00:00:00Z    MONTH   52.80   NaN     53.4    -0.29185    53.400  False   HCOB Services PMI   ES  EUR     None    ZERO    MEDIUM  False   False   False   False   False   1691047307  None
4   2023-08-03T07:45:00Z    2023-07-01T00:00:00Z    MONTH   51.50   NaN     52.2    -0.53725    52.200  False   HCOB Services PMI   IT  EUR     None    ZERO    MEDIUM  False   False   False   False   False   1691049115  None
...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...
72  2023-08-04T20:30:00Z    None    None    NaN     NaN     NaN     NaN     -232.600    None    CFTC S&P 500 NC Net Positions   US  USD     $   K   LOW     False   False   False   False   False   1690578149  None
73  2023-08-04T20:30:00Z    None    None    NaN     NaN     NaN     NaN     59.000  None    CFTC GBP NC Net Positions   UK  GBP     £   K   LOW     False   False   False   False   False   1690577994  None
74  2023-08-04T20:30:00Z    None    None    NaN     NaN     NaN     NaN     -77.800     None    CFTC JPY NC Net Positions   JP  JPY     ¥   K   LOW     False   False   False   False   False   1690578149  None
75  2023-08-04T20:30:00Z    None    None    NaN     NaN     NaN     NaN     177.200     None    CFTC EUR NC Net Positions   EMU     EUR     €   K   LOW     False   False   False   False   False   1690578105  None
76  2023-08-04T20:30:00Z    None    None    NaN     NaN     NaN     NaN     -51.200     None    CFTC AUD NC Net Positions   AU  AUD     $   K   LOW     False   False   False   False   False   1690578217  None

77 rows × 22 columns

Somehow when I print driver.page_source in my code before constructing the BeautifulSoup object with it, it prints the tables with the values. I thought it means it is getting the late loaded page contents too. — Walking Corpse, Aug 03 '23 at 09:08
It could be that api derived info is being (sometimes) loaded prior to BS parsing the page. In any case, Selenium is not needed here.If for various reasons you insist on using Selenium, you don't need BeautifulSoup. — Barry the Platipus, Aug 03 '23 at 09:19

Why does Beautifulsoup findall by xpath returns empty result?

2 Answers2