-2

I am very new to web scraping. I am working on Selenium and want to perform the task to extract the texts from span tags. The tags do not have any class and ids. The span tags are inside the li tags. I need to extract the text from a span tags that are inside of the li tags. I don't know how to do that. Could you please help me with that?

HTML of the elements:

<div class="cmeStaticMediaBox cmeComponent section">
    <div>
        <ul class="cmeList">

            <li class="cmeListContent cmeContentGroup">
                <ul class="cmeHorizontalList cmeListSeparator"> 

                    <li>
                        <!-- Default clicked -->
                        <span>VOI By Exchange</span>
                    </li>

                    <li>
                                    
                        <a href="https://www.cmegroup.com/market-data/volume-open-interest/agriculture-commodities-volume.html" class="none" target="_self">

                        <span>Agricultural</span></a>

                    </li>
                        
                    <li>

                        <a href="https://www.cmegroup.com/market-data/volume-open-interest/energy-volume.html" class="none" target="_self">

                        <span>Energy</span></a>
                    </li>
                </ul>
            </li>
        </ul>
    </div>
</div>
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • No, I was just asking for how to click on the link and get the text from the span values. I never ask for making the entire flow. But the first element does not have the link and I was confused. – Mayur Atreya Aug 31 '22 at 08:24
  • Ah, OK. Can you share the Selenium code you already wrote? If possible, containing the link/url to the page you are working on too. – Prophet Aug 31 '22 at 08:36
  • Sorry for any inconvinence from my side. – Mayur Atreya Aug 31 '22 at 08:49
  • I understand... The link you shared is not a readable code. – Prophet Aug 31 '22 at 08:50

3 Answers3

1

The simplest way to do this is

for e in driver.find_elements(By.CSS_SELECTOR, "ul.cmeHorizontalList a")
    print(e.text)

Some pitfalls in other answers...

  1. You shouldn't use exceptions to control flow. It's just a bad practice and is slower.

  2. You shouldn't use Copy > XPath from a browser. Most times this generates XPaths that are very brittle. Any XPath that starts at the HTML tag, has more than a few levels, or uses a number of indices (e.g. div[2] and the like) is going to be very brittle. Any even minor change to the page will break that locator.

  3. Prefer CSS selectors over XPath. CSS selectors are better supported, faster, and the syntax is simpler.

JeffC
  • 22,180
  • 5
  • 32
  • 55
0

EDIT

Since you need to use selenium, you can use XPATHs to locate elements when you don't have a tag on which you can refer to. From your favorite browser just F12, then right-click on the interested element and choose "Copy -> XPath". This is the solution proposed (I assume you have chrome and the chromedriver in the same folder of the .py file):

import os
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium import webdriver

url = "https://www.cmegroup.com/market-data/volume-open-interest/metals-volume.html"

i = 1
options = webdriver.ChromeOptions()
# this flag won't open a browser window, if you don't need the dev window uncomment this line
# options.add_argument("--headless")
driver = webdriver.Chrome(
            options=options, executable_path=os.getcwd() + "/chromedriver.exe"
        )
        
driver.get(url)
while True:
    xpath = f"/html/body/div[1]/div[2]/div/div[2]/div[2]/div/ul/li/ul/li[{i}]/a/span"
    try:
        res = driver.find_element(By.XPATH, xpath)
    except NoSuchElementException:
        # There are no more span elements in li
        break 
    print(res.text)
    i += 1

Results:

VOI By Exchange
Agricultural
Energy
Equities
FX
Interest Rates

You can extend this snippet to handle the .csv download from each page.

OLD

If you are working with a static html page (like the one you provided in the question) I suggest you to use BeautifulSoup. Selenium is more suited if you have to click, fill forms or interact with a web page. Here's a snippet with my solution:

from bs4 import BeautifulSoup

html_doc = """
    <div class="cmeStaticMediaBox cmeComponent section">
        <div>
            <ul class="cmeList">

                <li class="cmeListContent cmeContentGroup">
                    <ul class="cmeHorizontalList cmeListSeparator">

                        <li>
                            <!-- Default clicked -->
                            <span>VOI By Exchange</span>
                        </li>

                        <li>

                            <a href="https://www.cmegroup.com/market-data/volume-open-interest/agriculture-commodities-volume.html"
                                class="none" target="_self">

                                <span>Agricultural</span></a>

                        </li>

                        <li>

                            <a href="https://www.cmegroup.com/market-data/volume-open-interest/energy-volume.html" class="none"
                                target="_self">

                                <span>Energy</span></a>
                        </li>
                    </ul>
                </li>
            </ul>
        </div>
    </div>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

for span in soup.find_all("span"):
    print(span.text)

And the result will be:

VOI By Exchange
Agricultural
Energy
  • Thank you for the replay but I am working with the dynamic page. I have to click the link and download the csv file from each link – Mayur Atreya Aug 31 '22 at 08:06
  • The XPath you're using is very brittle and doesn't even work for me. Any XPath that starts at the HTML tag, has more than a few levels, or uses a number of indices (e.g. `div[2]` and the like) is going to be very brittle. Any minor change to the page will break that locator. Copying an XPath from the browser is most times going to give you a very brittle locator. I would avoid using those unless you are learning how to craft your own XPaths. – JeffC Aug 31 '22 at 14:27
  • Also, the BeatifulSoup portion of your answer prints ALL SPANs on the page, not just the ones in the OP's HTML. There are currently 41 SPANs on that page and only 7 desired SPANs... so the vast majority are not the ones that OP wanted. – JeffC Aug 31 '22 at 14:36
0

To extract the desired texts e.g. VOI By Exchange, Agricultural, Energy, etc you need to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

  • Using CSS_SELECTOR:

    driver.get('https://www.cmegroup.com/market-data/volume-open-interest/exchange-volume.html')
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "ul.cmeHorizontalList.cmeListSeparator li span")))])
    
  • Using XPATH:

    driver.get('https://www.cmegroup.com/market-data/volume-open-interest/exchange-volume.html')
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//button[@id='onetrust-accept-btn-handler']"))).click()
    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//ul[@class='cmeHorizontalList cmeListSeparator']//li//span")))])
    
  • Console Output:

    ['VOI By Exchange', 'Agricultural', 'Energy', 'Equities', 'FX', 'Interest Rates', 'Metals']
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352