Web scraping table with missing attributes via Python Selenium and Pandas

Question

Scraping a table from a website. But encountering empty cells during the process. Below try-except block is screwing up the data at the end. Also dont want to exclude the complete row, as the information is still relevant even when the some attribute is missing.

try:
    for i in range(10):
        data = {'ID': IDs[i].get_attribute('textContent'),
                'holder': holder[i].get_attribute('textContent'),
                'view': view[i].get_attribute('textContent'),
                'material': material[i].get_attribute('textContent'),
                'Addons': addOns[i].get_attribute('textContent'),
                'link': link[i].get_attribute('href')}
        list.append(data)
except:
    print('Error')

Any ideas?

error without the try-except block is IndexError: list index out of range — Lilly, Aug 21 '22 at 19:40
It's hard to tell exactly what you are asking. What result are you trying to obtain? The try-except block cannot be "screwing up the data." — AlexK, Aug 21 '22 at 21:05
Its not the data, but the rows of data which are being screwed up. — Lilly, Aug 22 '22 at 15:07

Scoobylolo · Accepted Answer · 2022-08-24T17:20:06.227

0

What you can do is place all the objects to which you want to access the attributes to in a dictionary like this:

objects={"IDs":IDs,"holder":holder,"view":view,"material":material...]

Then you can iterate through this dictionary and if the specific attribute does not exist, simply append an empty string to the value corresponding to the dict key. Something like this:

the_keys=list(objects.keys())
for i in range(len(objects["IDs"])): #I assume the ID field will never be empty
   #so making a for loop like this is better since you iterate only through 
   #existing objects
   data={}
   
   for j in range(len(objects)):
      try:
         data[the_keys[j]]=objects[the_keys[j]][i].get_attribute('textContent')
      except Exception as e:
         print("Exception: {}".format(e))
         data[the_keys[j]]="" #this means we had an exception
         #it is better to catch the specific exception that is thrown
         #when the attribute of the element does not exist but I don't know what it is
   list.append(data)

I don't know if this code works since I didn't try it but it should give you an overall idea on how to solve your problem.

If you have any questions, doubts, or concerns please ask away.

Edit: To get another object's attribute like the href you can simply include an if statement checking the value of the key. I also realized you can just loop through the objects dictionary getting the keys and values instead of accessing each key and value by an index. You could change the inner loop to be like this:

for key,value in objects.items():
   try:
      if key=="link":
         data[key]=objects[key][i].get_attribute("href")
      else:
         data[key]=objects[key][i].get_attribute("textContent")
   except Exception as e:
      print("Error: ",e)
      data[key]=""

Edit 2:

data={}
for i in list(objects.keys()):
   data[i]=[]
for key,value in objects.items():
   for i in range(len(objects["IDs"])):
      try:
         if key=="link":
            data[key].append(objects[key][i].get_attribute("href"))
         else:
            data[key].append(objects[key][i].get_attribute("textContent"))
      except Exception as e:
         print("Error: ",e)
         data[key].append("")

Try with this. You won't have to append the data dictionary to the list. Without the original data I won't be able to help much more. I believe this should work.

edited Aug 24 '22 at 17:20

answered Aug 21 '22 at 21:28

Scoobylolo

26
3

With some tweaking the above code is working. What about the Link object? its not textContent but a hfref (.get_attribute('href')) – Lilly Aug 22 '22 at 15:07
Check the edit above. Now you should be good to get the link object's "href" attribute. – Scoobylolo Aug 23 '22 at 16:17
Thank, Scoobylolo. But going back to the initial problem. The empty string is not appended to the corresponding dict key. This because some of the tables have XPATHs with 10 rows and some 9 rows (//*[@id="table_1"]/tbody/tr[9]/td[12]/a). This causes the IndexError. The empty string ends up at the wrong row via your solution (data[the_keys[j]]=""). How can this be solved via Error handling (maybe with RegEx?). FYI, the tables are dynamic. – Lilly Aug 23 '22 at 17:22
I believe I found the problem in my code. I am first looping through the lists inside all the selenium objects and then through the dictionary containing these lists. It should be the other way around. The outer loop should be the inner loop and vice-versa. Then, change where you use the i and j correspondingly if it needs to be changed. If you have trouble switching the loops tell me and I will edit the code in my answer. – Scoobylolo Aug 23 '22 at 21:35
I added some other code which might be able to fix the problem but I am not sure. Thanks. – Scoobylolo Aug 24 '22 at 17:21
After some tweaking the code its working very well. Its scraping tables from the website. But the following error occurs: selenium.common.exceptions.NoSuchWindowException: Message: no such window: window was already closed (Session info: chrome=104.0.5112.102). Any idea how this can be solved? sometimes is happens at table 800, 900 or 1900. The error occurs at wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="table_1_next"]'))).click() – Lilly Aug 25 '22 at 13:49
Maybe something like this work (did not tested it yet) page += 1 print("Number of pages " + str(page)) try: wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="table_1_next"]'))).click() except: wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="table_1_paginate"]/span/a[text()="{page}"]'))).click() – Lilly Aug 25 '22 at 16:08
Mmm I am not sure why you get this error because I am not able to see the code. Maybe you closed the window you are trying to scrape or there is an iframe in the code and you need to switch to it. Maybe this thread can help you https://stackoverflow.com/questions/63094654/selenium-common-exceptions-nosuchwindowexception-message-no-such-window-error since I answered the question can you please up-vote it and accept it? If you have any other doubts I can help you by either answering your question in another thread or sending direct messages. Although I think Stack-overflow doesn't have that. – Scoobylolo Aug 25 '22 at 20:07
I will check the link. Again thanks for the help! I accepted your answer unfortunately not enough reputation points to up-vote :( – Lilly Aug 26 '22 at 12:06

Web scraping table with missing attributes via Python Selenium and Pandas

1 Answers1