How to extract the time and title of each of the 7 main news within https://tengrinews.kz using Selenium and Python

Question

i need to scrape 7 main news from this website - tengrinews.kz, the date, time and title of each news. I use selenium and installed firefox developer edition.

I inspected the website and the 7 news are located in this structure:

<body>
   <header> ... some stuff </header>
   <main>
      <div class="tn-main-news-grid">

         <div class="tn-main-news-item firs-column tn-three-column tn-background-cover">  
            <span class="tn-main-news-title" style="z-index: 1;">BIG MAJOR NEWS TEXT</span>
            <a href="/kazakhstan_news/major-news/" class="tn-link"><span class="tn-hidden">BIG MAJOR NEWS TEXT</span></a>
         </div>



         <div class="tn-main-news-item"> 
            <span class="tn-main-news-title">news1 TEXT</span>
            <a href="/kazakhstan_news/news1/" class="tn-link">
            <span class="tn-hidden">news1 TEXT</span></a>
         </div>



         <div class="tn-main-news-item"> 
            <span class="tn-main-news-title">news2 TEXT</span>
            <a href="/kazakhstan_news/news2/" class="tn-link">
            <span class="tn-hidden">news2 TEXT</span></a>
         </div>



         <div class="tn-main-news-item"> 
            <span class="tn-main-news-title">news3 TEXT</span>
            <a href="/kazakhstan_news/news3/" class="tn-link">
            <span class="tn-hidden">news3 TEXT</span></a>
         </div>

      </div>
   </main>
</body>

I located the div frame that contains all 7 news by xpath or css_selector . I do get firefox web element, but it's a list and it's empty!

If i try locating single href or div it gives back some web element of type 'list' and this href must have text attribute(according to selenium docs) - but it gives me error "no attribute text"

from selenium import webdriver
driver = webdriver.Firefox()

driver.get("https://tengrinews.kz")

css_to_big_news = 'html body div.my-app main section.tn-main-section.tn-container div.tn-main-news-container.tn-sub-container div.tn-main-news-grid div.tn-main-news-item.firs-column.tn-three-column.tn-background-cover a.tn-link'


href_big = driver.find_elements_by_css_selector(css_to_big_news)
print('type of href_big is %s and length is %d' %(type(href_big), len(href_big)))

print(href_big[0].text) #this is wrong
print(href_big.text()) # this is wrong with parenthesis

what's wrong?

undetected Selenium · Accepted Answer · 2020-07-19T19:05:58.303

To extract the texts e.g. TEXT, from each <span> using Selenium and python you have to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

Using CSS_SELECTOR:

driver.get("https://tengrinews.kz/")
print("Date and Time:")
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.tn-main-news-grid div.tn-main-news-item ul.tn-data-list>li>span time")))])
print("Title:")
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.tn-main-news-grid div.tn-main-news-item span.tn-main-news-title")))])

Using XPATH:

driver.get("https://tengrinews.kz/")
print("Date and Time:")
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='tn-main-news-grid ']//div[contains(@class, 'tn-main-news-item')]//ul[@class='tn-data-list']/li/span//time")))])
print("Title:")
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='tn-main-news-grid ']//div[contains(@class, 'tn-main-news-item')]//span[@class='tn-main-news-title']")))])

Console Output:

Date and Time:
['вчера, 18:27', 'вчера, 21:45', 'вчера, 20:52', 'вчера, 19:48', 'вчера, 17:34', 'вчера, 14:50', 'вчера, 14:32']
Title:
['Жара до 42 градусов ожидается в регионах Казахстана', 'Строгий карантин вводят в Мангистауской области', 'Нехватку вакцин и новую "суровую" волну COVID-19 предрекли в мире', 'Столицу Казахстана "оживили"', 'Жители Актау собрались на площади из-за отсутствия лекарств в аптеках', 'Строгий карантин в Нур-Султане продлили до 2 августа', '"Едят антибиотики". Врач из Павлодара объяснил рост числа тяжелых больных']

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Outro

Link to useful documentation:

get_attribute() method Gets the given attribute or property of the element.
text attribute returns The text of the element.
Difference between text and innerHTML using Selenium

you even added the time i was looking for it.. thx u – ERJAN Jul 19 '20 at 18:57 — ERJAN, Jul 19 '20 at 18:57
@ERJAN Added `CSS_SELECTOR` for your convenience. – undetected Selenium Jul 19 '20 at 19:06 — undetected Selenium, Jul 19 '20 at 19:06

How to extract the time and title of each of the 7 main news within https://tengrinews.kz using Selenium and Python

1 Answers1

Outro