0

I want to scrape article on the news website Al Jazeera. I wrote relative xpath which can lead me to the sentence on browser dev tool. But bizarrely, while using the exact same xpath, scraping text failed. For example, there is a news (url: https://www.aljazeera.com/economy/2023/2/6/who-is-gautam-adani-and-why-is-he-controversial)

xpaths:

//header[@class="article-header"]/h1
//header[@class="article-header"]//em
//main[@id="main-content-area"]/div[2]/p[1]
//main[@id="main-content-area"]/div[2]/p[2]
//main[@id="main-content-area"]/div[2]/p[3]
//main[@id="main-content-area"]/div[2]/p[4]

... etc but nothing got scraped.

I tested both

.text
.get_attribute('textContent')

failed both because there is no invisible text.

Please help me to scrape the paragraphs.

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352

4 Answers4

1

All of your locators are correct. To print the texts from the website ideally you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following locator strategies:

  • Code block:

    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.chrome.service import Service
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
    options = Options()
    options.add_argument("start-maximized")
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('excludeSwitches', ['enable-logging'])
    options.add_experimental_option('useAutomationExtension', False)
    options.add_argument('--disable-blink-features=AutomationControlled')
    s = Service('C:\\BrowserDrivers\\chromedriver.exe')
    driver = webdriver.Chrome(service=s, options=options)
    driver.get('https://www.aljazeera.com/economy/2023/2/6/who-is-gautam-adani-and-why-is-he-controversial')
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//header[@class='article-header']/h1"))).text)
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//header[@class='article-header']//em"))).text)
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//main[@id='main-content-area']/div[2]/p[1]"))).text)
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//main[@id='main-content-area']/div[2]/p[2]"))).text)
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//main[@id='main-content-area']/div[2]/p[3]"))).text)
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//main[@id='main-content-area']/div[2]/p[4]"))).text)
    
  • Console Output:

    Who is Gautam Adani and why is he controversial?
    The Indian entrepreneur has seen his wealth plummet after a research firm accused him of ‘brazen stock manipulation’.
    Allegations of stock market manipulation and fraud have halved the net worth of Indian tycoon Gautam Adani, one of the wealthiest people in the world, in less than two weeks and wiped more than $110bn from his listed firms in India.
    With investor confidence shaken, legislators have demanded an investigation into his businesses. Here’s a look at who Adani is, what concerns have been raised and what has happened since.
    Who is Gautam Adani?
    He is the founder and chairman of the Adani Group, one of the largest business conglomerates in India. A native of Gujarat — the same state where India’s Prime Minister Narendra Modi is from — Adani, 60, is a college dropout. He walked away from his father’s textile shop to set up a commodities trading business in 1988, his entry into the world of business.
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
0

I hope this will work for your solution, Kindly add the options which I define in the code

from webdriver_manager.chrome import ChromeDriverManager
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
options = Options()
# options.add_argument('--disable-blink-features=AutomationControlled')
service = ChromeService(executable_path=ChromeDriverManager().install())
options.add_experimental_option('excludeSwitches', ['enable-logging']) # KINDLY ADD THIS OPTION
driver = webdriver.Chrome(service=service, options=options)
URL = ' https://www.aljazeera.com/economy/2023/2/6/who-is-gautam-adani-and-why-is-he-controversial'
driver.get(URL)
# Define your code here
# //header[@class="article-header"]/h1
# //header[@class="article-header"]//em
# //main[@id="main-content-area"]/div[2]/p[1]
# //main[@id="main-content-area"]/div[2]/p[2]
# //main[@id="main-content-area"]/div[2]/p[3]
# //main[@id="main-content-area"]/div[2]/p[4]
h1_tag = driver.find_elements(By.XPATH, '//header[@class="article-header"]/h1')[0]
print(f'h1: {h1_tag.text}')
em_tag = driver.find_elements(By.XPATH, '//header[@class="article-header"]//em')[0]
print(f'em: {em_tag.text}')
for i in range(1, 5):
    p_tag = driver.find_elements(By.XPATH, f'//main[@id="main-content-area"]/div[2]/p[{i}]')[0]
    print(f'p{i}: {p_tag.text}')
driver.quit()
Muhammad Ali
  • 444
  • 7
-1

Try using full xpath

find_element("xpath","/html/body/div[1]/div/div[3]/div/div/div/div[1]/main/div[2]/p[1]")
Filip
  • 1
  • 2
-1

I re-write the code and it works. The reason why it didn't work is that I was trying to throw this code below into another integrated code. Maybe there is something wrong during the mergering.

It is hard to combine different def(s) together. Thx for the answers provided.

the code below works:

# import library
import os
from selenium import webdriver
from selenium.webdriver.edge.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

# default parameters
desktop_path = os.path.join(os.path.join(os.environ['USERPROFILE']), 'Desktop')
edge_driver_path = desktop_path + r"\msedgedriver.exe"

# page url
url = "https://www.aljazeera.com/economy/2023/2/6/who-is-gautam-adani-and-why-is-he-controversial"

# xpath
new_title = "//header[@class='article-header']/h1"
new_brief = "//header[@class='article-header']//em"
new_par01 = "//main[@id='main-content-area']/div[2]/p[1]"
new_par02 = "//main[@id='main-content-area']/div[2]/p[2]"
new_par03 = "//main[@id='main-content-area']/div[2]/p[3]"
new_par04 = "//main[@id='main-content-area']/div[2]/p[4]"
new_par05 = "//main[@id='main-content-area']/div[2]/p[5]"
new_par06 = "//main[@id='main-content-area']/div[2]/p[6]"
new_par07 = "//main[@id='main-content-area']/div[2]/p[7]"
new_par08 = "//main[@id='main-content-area']/div[2]/p[8]"
new_par09 = "//main[@id='main-content-area']/div[2]/p[9]"
new_par10 = "//main[@id='main-content-area']/div[2]/p[10]"
xpath_list = [new_title, new_brief,
              new_par01, new_par02, new_par03, new_par04, new_par05,
              new_par06, new_par07, new_par08, new_par09, new_par10]

def paragraph_scraping(url, xpath_list):
    # the Edge driver
    s = Service(edge_driver_path)
    driver = webdriver.Edge(service=s)

    # open url
    driver.get(url)

    # manipulate browser windows to load information on page
    driver.set_window_size(1024, 600)
    driver.maximize_window()
    driver.execute_script("window.scrollTo(0, 1000)")
    time.sleep(0.5)
    driver.execute_script("window.scrollTo(0, 500)")
    time.sleep(0.5)
    driver.execute_script("window.scrollTo(0, 300)")
    time.sleep(0.5)
    driver.execute_script("window.scrollTo(0, 100)")
    time.sleep(1)

    # create paragraph container
    news_sentences = []
    for xpath in xpath_list:
        try:
            a = WebDriverWait(driver, 0.5)
            # title extract
            b = a.until(EC.presence_of_element_located((By.XPATH, xpath)))
            c = b.get_attribute('textContent')
            news_sentences.append(c)
        except:
            pass

    # join sentences
    news_paragraph = "\n".join(news_sentences)

    return news_paragraph

print(paragraph_scraping(url, xpath_list))