0

I'm having a bit of a headscratcher here, where I'm working on a table with Python 3 and selenium. I am trying to extract some data from a table (tblGuid), and get some info from a few columns.

While the data is presumably retrieved correctly (the len(rows) prints the expected amount of rows), the iterator seems to get stuck on the first element, only printing the same socket repeatedly, with the amount of prints matching len(rows)

vlan = "vlan14"

time.sleep(3)
# Enter filter for vlan
print("Filtered by vlan: " + vlan)
browser.find_element_by_xpath("/html/body/div[1]/div[4]/div[3]/div[4]/div/div[2]/div/div[1]/div[3]/div/table/tfoot/tr/th[13]/input").send_keys(vlan)

# Sort by socket
browser.find_element_by_xpath("/html/body/div[1]/div[4]/div[3]/div[4]/div/div[2]/div/div[1]/div[1]/div/table/thead/tr/th[14]").click()

time.sleep(2)
table = browser.find_element_by_id('tblGuid')
rows = table.find_elements_by_xpath(".//tr")

time.sleep(2)

print("Len: ", len(rows))

for row in rows:
    socket = row.find_element_by_xpath('//td[10]').text
    print("Socket: ", socket)
    # Other stuff of the same natures as the above two lines go here. Get a different column and assign it to a variable.

browser.quit()

I am running this code with firefox and not turning on headless mode, to confirm that all clicks, sorts, and filters are applied as intended. The browser output looks as expected, and the data is all there, with socket being a number that varies between 1 and 52 over ~50 rows. It seems to me that the for loop gets stuck on the first element of rows.

I have added a lot of (probably redundant time.sleep() to ensure that the page is loaded properly, and so that I can see the page being updated as the script progresses.

It is probably worth mentioning that the page I am scraping does not contain the table data in HTML, as it is populated by javascript working on a database. At first I thought this was the problem, but the fact that the data being printed as socket matches the first line of the table (as does any other columns) tells me that the data is being retrieved correctly, but I fail to iterate over it.

EDIT - A cleaned up version of the HTML

<table id="tblGuid" class="table table-striped table-hover table-condensed detailedTable table-bordered dataTable" style="width: 99.9%;" role="grid" aria-describedby="tblGuid_info">                    
    <tbody>
        <tr role="row" class="odd">
            <td><button class="tableButton regguid" data-guid="0046ca">Reg.</button></td>
            <td>0046ca</td>
            <td>0110F17754</td>
            <td>A18122</td>
            <td><a href="detail?serial=37530" target="_blank">37530</a></td>
            <td>05929a</td>
            <td>3.0.0</td>
            <td>19-12-21 19:56</td>
            <td>20-01-19 19:53</td>
            <td>20-01-19 19:53</td>
            <td>20526661632</td>
            <td>1</td>
            <td>vlan14</td>
            <td class="sorting_1">1</td>
            <td>0</td>
            <td><a data-node-error="0" data-node-guid="0046ca" href="#">            0</a></td>
            <td><a href="qc?rclId=1279" target="_blank">145811</a></td>
            <td>5554</td>
            <td>152263</td>
            <td>Done</td>
        </tr>
        <tr role="row" class="even">
            <td><button class="tableButton regguid" data-guid="004aa4">Reg.</button></td>
            <td>004aa4</td>
            <td>0110F17D8D</td>
            <td>A19108</td>
            <td><a href="detail?serial=37740" target="_blank">37740</a></td>
            <td>05936c</td>
            <td>3.0.0</td>
            <td>19-12-21 20:15</td>
            <td>20-01-19 19:54</td>
            <td>20-01-19 19:54</td>
            <td>20517699584</td>
            <td>1</td>
            <td>vlan14</td>
            <td class="sorting_1">2</td>
            <td>0</td>
            <td><a data-node-error="0" data-node-guid="004aa4" href="#">            0</a></td>
            <td><a href="qc?rclId=1277" target="_blank">147011</a></td>
            <td>5548</td>
            <td>152311</td>
            <td>Done</td>
        </tr>
    </tbody>
</table>

Notes on the above HTML:

  • Around 40 table rows removed for readability.
  • Table header and footer has been removed.
  • Some data in the cells have been altered for the purpose of this post. The structure remains the same.
  • this is how it appears under "inspect element" in firefox.
  • The xpath referenced in the python code is based on "copy -> xpath" under inspect element.
Jarmund
  • 3,003
  • 4
  • 22
  • 45

2 Answers2

2

Without the table html this is my best guess. It looks like the xpath is not quite doing what is expected. Try to use: find_element_by_xpath('.//td[10]').text

for row in rows:
    columns = row.find_elements_by_xpath('.//td')
    for column in range(len(columns)):
        print("column::{}:".format(column), columns[column].text)
    #testsocket = columns[9].text
    socket = row.find_element_by_xpath('.//td[10]').text
    print("Socket: ", socket)
    #print("TestSocket: ", testsocket)
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
Jortega
  • 3,616
  • 1
  • 18
  • 21
  • I tried changing the xpath as per your answer, as xpath is something I haven't completely gotten a grasp on yet (I based the xpath in my code on whatever firefox told me). However, the result was that it couldn't find any data at all. I've updated the question with the HTML you requested. – Jarmund Jan 21 '20 at 18:06
  • @Jarmund in the for loop does adding `print("row: ", row.text)` return the expected text? – Jortega Jan 21 '20 at 18:29
  • @actually it does. `row.text` just listed all rows as they appear on the page, so as you suspected, there must be something wrong with my row.find_element_by_xpath(). The string I'm after resides in the 10th column of each row. – Jarmund Jan 21 '20 at 18:42
  • @Jarmund See the update I was wrong about `..//` it should be `.//`. See if the new variable I added `testsocket` prints what you are looking for on your end. – Jortega Jan 21 '20 at 19:07
  • the script fails at `testsocket =` d/t index out of range when using `.//td`. I noticed that If I instead use `//td`, `len(columns) returns an abnormally high number, as if all cells in the entire table are contained in the single row. – Jarmund Jan 21 '20 at 19:43
  • @Jarmund I added for loop to print the text in all the columns of the row so you can see what index you want. – Jortega Jan 21 '20 at 20:11
  • There we go! Your code/suggestion worked once I added a continue statement if the amount of columns was less than expected. Turns out that the javascript is adding a few hidden rows for a split second. – Jarmund Jan 21 '20 at 20:20
1

Use WebDriverWait to wait require element conditions. You can google best practice for locators,here and here. I'm suggesting you to change locators for input for vlan and click.

In the row.find_element_by_xpath('.//td[10]').text code to get children using xpath, you to put . dot, also can be use attributes if exist.

To get text in Selenium, the element should be visible, that's why wiat for visibility of all rows.

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


options = webdriver.ChromeOptions()
options.headless = True

driver = webdriver.Chrome(options=options)
wait = WebDriverWait(driver, 10)

vlan = "vlan14"
with driver:
    driver.get("url")

    # Enter filter for vlan
    wait.until(EC.element_to_be_clickable((By.XPATH, "/html/body/div[1]/div[4]/div[3]/div[4]/div/div[2]/div/div[1]/div[3]/div/table/tfoot/tr/th[13]/input"))).send_keys(vlan)

    # Sort by socket
    driver.find_element_by_xpath(
        "/html/body/div[1]/div[4]/div[3]/div[4]/div/div[2]/div/div[1]/div[1]/div/table/thead/tr/th[10]").click()

    rows = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, '#tblGuid tr[role=row]')))
    for row in rows:
        table_data = row.find_elements_by_css_selector('td')
        socket = table_data[9].text
        print("Socket: ", socket)
Sers
  • 12,047
  • 2
  • 12
  • 31
  • I do not know how many rows the table will eventually contain, other than it should be at least 1 (3 including header and footer). Any suggestion on how to work around this? – Jarmund Jan 21 '20 at 17:50
  • Add html with table headers also – Sers Jan 21 '20 at 18:13
  • If i use `.//td[10]` instead of `//td[10]`, the script fails as it cannot find the cell – Jarmund Jan 21 '20 at 19:51
  • Because I cannot check, use updated code from the answer. If you use `//td[10]` it will get all `td`on the page and not under row, it means you'll get always same `td` – Sers Jan 21 '20 at 19:59