How to scrape table from webpage with Selenium and Headless Chrome

Question

I was trying to write a script with Python to export the Product Attributes table as an Excel file (or CSV) from the URL below.

I wrote a script and tried a different class name, but I faced an error!

The URL: https://www.digikey.com/en/products/detail/texas-instruments/uln2003aidre4/1912622

I don't know what the reason for this message is because I could export the table from different websites but my code crashed on this website. (And also Mouser.com)

I had a theory and I think these two websites are blocking my script to avoid exporting their data but I'm not sure.

The table I want to export and its inspection

Here is my code:

import time
import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import pandas as pd

def get_specifications_table(url):
    options = Options()
    options.add_argument('--headless')  # Run the browser in headless mode (no visible window)

    driver = webdriver.Chrome(options=options)

    driver.get(url)
    time.sleep(5)  # Add a delay to allow the webpage to load (adjust the time as needed)

    try:
        # Find the element with the specified class name "MuiTable-root css-u6unfi" and extract the table
        class_name = "MuiTable-root.css-u6unfi"
        table_element = driver.find_element("css selector", f".{class_name}")
        table_html = table_element.get_attribute('outerHTML')
        df = pd.read_html(table_html)[0]
        return df
    except Exception as e:
        print("Error:", e)
    finally:
        driver.quit()

    return None

def export_to_excel(df, output_file):
    writer = pd.ExcelWriter(output_file, engine='xlsxwriter')
    df.to_excel(writer, index=False)
    writer.save()
    writer.close()

if __name__ == '__main__':
    url = "https://www.digikey.com/en/products/detail/texas-instruments/uln2003aidre4/1912622"
    output_excel_file = "Specifications_Table_Digikey.xlsx"

    print("Fetching the webpage and extracting the table...")
    specifications_df = get_specifications_table(url)
    
    if specifications_df is not None:
        print("Exporting the table to Excel...")
        export_to_excel(specifications_df, output_excel_file)
        print(f"Table 'Specifications' exported to '{output_excel_file}' successfully.")
    else:
        print("Table extraction or export failed.")

But I face this error:

Fetching the webpage and extracting the table...
Error: Message: no such element: Unable to locate element: {"method":"css selector","selector":".MuiTable-root.css-u6unfi"}
  (Session info: headless chrome=115.0.5790.110); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
Backtrace:
    GetHandleVerifier [0x004BA813+48355]
    (No symbol) [0x0044C4B1]
    (No symbol) [0x00355358]
    (No symbol) [0x003809A5]
    (No symbol) [0x00380B3B]
    (No symbol) [0x003AE232]
    (No symbol) [0x0039A784]
    (No symbol) [0x003AC922]
    (No symbol) [0x0039A536]
    (No symbol) [0x003782DC]
    (No symbol) [0x003793DD]
    GetHandleVerifier [0x0071AABD+2539405]
    GetHandleVerifier [0x0075A78F+2800735]
    GetHandleVerifier [0x0075456C+2775612]
    GetHandleVerifier [0x005451E0+616112]
    (No symbol) [0x00455F8C]
    (No symbol) [0x00452328]
    (No symbol) [0x0045240B]
    (No symbol) [0x00444FF7]
    BaseThreadInitThunk [0x772500C9+25]
    RtlGetAppContainerNamedObjectPath [0x77BC7B4E+286]
    RtlGetAppContainerNamedObjectPath [0x77BC7B1E+238]

Table extraction or export failed.

Check `driver.page_source` variable. – Vishnudev Krishnadas Jul 30 '23 at 09:22 — Vishnudev Krishnadas, Jul 30 '23 at 09:22

score 0 · Answer 1 · edited Jul 30 '23 at 09:52

Check your driver.page_source to get an idea of what happens:
```
print(driver.page_source)
```

Based on that information set a user-agent to avoid the block:

options = Options()
options.add_argument('--headless')
options.add_argument('user-agent=whatever you like to set')
driver = webdriver.Chrome(options=options)

Select your elements more specifically in this case directly via pandas.read_html() and specific attribute:
```
pd.read_html(driver.page_source, attrs={'id':'product-attributes'})[0].iloc[:,:2]
```

Example

...
options = Options()
options.add_argument('--headless')
options.add_argument('user-agent=whatever you like to set')
driver = webdriver.Chrome(options=options)

driver.get('https://www.digikey.com/en/products/detail/texas-instruments/uln2003aidre4/1912622')

pd.read_html(driver.page_source, attrs={'id':'product-attributes'})[0].iloc[:,:2]

Output

	Type	Description
0	Category	Integrated Circuits (ICs)Power Management (PMIC)Power Distribution Switches, Load Drivers
1	Mfr	Texas Instruments
2	Series	ULx200xA
3	Package	Tape & Reel (TR)
4	Product Status	Discontinued at Digi-Key
5	Switch Type	Relay, Solenoid Driver
6	Number of Outputs	7
7	Ratio - Input:Output	1:1
8	Output Configuration	Low Side
...

score 0 · Accepted Answer · answered Jul 30 '23 at 17:43

To scrape data from the Product Attributes table from the website ULN2003AIDRE4 Texas Instruments | Integrated Circuits (ICs) | DigiKey you need to induce WebDriverWait for the visibility_of_element_located() for the <table> element and using DataFrame from Pandas you can use the following locator strategy:

Code Block:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

options = Options()
options.add_argument('--headless=new')
options.add_argument("start-maximized")
driver = webdriver.Chrome(options=options)
driver.get("https://www.digikey.com/en/products/detail/texas-instruments/uln2003aidre4/1912622")
time.sleep(10)
table_data = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div[data-evg='product-details-product-attributes'] table.MuiTable-root"))).get_attribute("outerHTML")
df = pd.read_html(table_data)
print(df)
driver.quit()

Console Output:

[                          Type                                        Description  Select
0                     Category  Integrated Circuits (ICs)Power Management (PMI...     NaN
1                          Mfr                                  Texas Instruments     NaN
2                       Series                                           ULx200xA     NaN
3                      Package                                   Tape & Reel (TR)     NaN
4               Product Status                           Discontinued at Digi-Key     NaN
5                  Switch Type                             Relay, Solenoid Driver     NaN
6            Number of Outputs                                                  7     NaN
7         Ratio - Input:Output                                                1:1     NaN
8         Output Configuration                                           Low Side     NaN
9                  Output Type                                         Darlington     NaN
10                   Interface                                           Parallel     NaN
11              Voltage - Load                                          50V (Max)     NaN
12  Voltage - Supply (Vcc/Vdd)                                       Not Required     NaN
13      Current - Output (Max)                                              500mA     NaN
14                Rds On (Typ)                                                  -     NaN
15                  Input Type                                          Inverting     NaN
16                    Features                                                  -     NaN
17            Fault Protection                                                  -     NaN
18       Operating Temperature                                 -40°C ~ 105°C (TA)     NaN
19               Mounting Type                                      Surface Mount     NaN
20     Supplier Device Package                                            16-SOIC     NaN
21              Package / Case                     16-SOIC (0.154", 3.90mm Width)     NaN
22         Base Product Number                                            ULN2003     NaN]

References

You can find a couple of relevant detailed discussions in:

I had been trying a lot with selenium and request library, but from all the ways I gone I face with this: `Backtrace: GetHandleVerifier [0x00A4A813+48355] (No symbol) [0x009DC4B1] (No symbol) [0x008E5358] (No symbol) [0x009109A5] ....` But your script works thanks dude! — Arman Zamani, Aug 02 '23 at 12:16

How to scrape table from webpage with Selenium and Headless Chrome

2 Answers2

Example

Output

References