Can't extract data from site

Question

I need to extract data from the page: https://acsjournals.onlinelibrary.wiley.com/doi/full/10.3322/caac.21774

My code is:

def fetch_current_article_data(url_article):
    options = webdriver.ChromeOptions()
    options.add_argument("--headless")
    driver = webdriver.Chrome(options=options)

    try:
        driver.get(url_article)
        time.sleep(5)  # Allow time for page content to load

        article_title_element = WebDriverWait(driver, 20).until(
            EC.presence_of_element_located((By.XPATH, '//*[@id="article__content"]/div[2]/div/h1'))
        )
        article_title = article_title_element.text.strip()
        print("Article Title:", article_title)


    except Exception as e:
        print("An error occurred:", e)

    finally:
        driver.quit()

fetch_current_article_data("https://acsjournals.onlinelibrary.wiley.com/doi/full/10.3322/caac.21774")

I tried for different selectors, but it looks like the request didn't bring the content. What is my error? Bard and ChatGPT didn't help.

_What is my error?_ You haven't shared the output/errors from the code, so we have no idea what your error is... — John Gordon, Aug 17 '23 at 03:32

score 1 · Answer 1 · answered Aug 17 '23 at 04:21

When using the "--headless" browser option, instead of going to the actual page, it went to Cloudflare security check.

Can check Selenium headless: How to bypass Cloudflare detection using Selenium for more information. I removed the headless option, changed the XPath and is able to retrieve the article title.

from selenium import webdriver
import chromedriver_binary 
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

def fetch_current_article_data(url_article):
    options = webdriver.ChromeOptions()
    #options.add_argument("--headless")
    driver = webdriver.Chrome(options=options)    
    
    try:
        print("Getting URL :",url_article)
        driver.get(url_article)
        time.sleep(5)  # Allow time for page content to load
        #print("Page Source => ",  driver.page_source)
        #print('Checking for X path : //*[@id="article__content"]/div[2]/div/h1')
        article_title_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, '//*[@id="article__content"]/div[2]/div/h1'))            
        )        
        article_title = article_title_element.text.strip()
        print("Article Title:", article_title)


    except Exception as e:
        print("An error occurred:", e)

    finally:
        driver.quit()

fetch_current_article_data("https://acsjournals.onlinelibrary.wiley.com/doi/full/10.3322/caac.21774")

output

(py_env) E:\>python test_python.py

DevTools listening on ws://127.0.0.1:57028/devtools/browser/fe517a61-04e6-4907-9c85-15a16181dc06
Getting URL : https://acsjournals.onlinelibrary.wiley.com/doi/full/10.3322/caac.21774
Article Title: Acquiring tissue for advanced lung cancer diagnosis and comprehensive biomarker testing: A National Lung Cancer Roundtable best-practice guide

Can't extract data from site

1 Answers1