0

This question is related to my previous two: Inducing WebDriverWait for specific elements and Inconsistency in scraping through <div>'s in Selenium.

I am scraping all of the Air Jordan sneakers off of https://www.grailed.com/. The feed is an infinitely scrolling list of sneakers and I am using Selenium webdriver to scrape the data. My problem is that the images for the shoes seem to take a while to load, so it throws a lot of errors. I have found the pattern in the xpath's of the images. The xpath to the first image is /html/body/div[3]/div[6]/div[3]/div[3]/div[2]/div[2]/div[1]/a/div[2]/img, and the second is /html/body/div[3]/div[6]/div[3]/div[3]/div[2]/div[2]/div[2]/a/div[2]/img etc. It follows this linear sequences where the second to last div index increases by one each time. To handle this I put the following in my loop (only relevant code is included).

    i = 1
    while len(sneakers) < sneaker_count:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    # Get sneakers currently on page and add to sneakers list
    feed = driver.find_elements_by_class_name('feed-item')
    for item in feed:
        xpath = "/html/body/div[3]/div[6]/div[3]/div[3]/div[2]/div[2]/div[" + str(i) +   "]/a/div[2]/img"
        img = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, xpath)))
        i += 1
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

The issue is, after about the 5th pair of shoes, the wait statement times out, it seems that the xpath passed in after that pair of shoes is not recognized. I used FireFox Developer to check the xpath using the copy xpath feature, and it seems identical to the passed in xpath when I print it. I use ChromeDriver w/Selenium but I don't think that's relevant. Does anyone know why the xpath's stop being recognized even though they seem identical?

UPDATE: So using an Xpath checker add-on to Chrome, it detects xpaths for items 1-4, but often stops detecting them after 6. When I check the xpath (both on Chrome and FireFox Developer mode, the xpath still looks identical, but it doesn't detect them when I use the "CSS and Xpath checker" it still doesn't seem to come out. This is a huge mystery to me.

Eric Hasegawa
  • 169
  • 1
  • 14
  • 1
    Two questions: 1) Why would you construct your locators based on indexes? As the elements in the webpage are dynamic elements soon you may hit _Timeout_ exception. 2) Why would you wait for **20** seconds as a waiter for 40+ elements, which totals to 20X40=800 seconds, i.e. more then 10 minutes. Instead, you should wait only once for a specified time, say 10 secs or 20 secs for all your desired elements to appear within the HTML. – undetected Selenium Jun 25 '20 at 18:18
  • @DebanjanB I tried using a single load for all of them but it didn't work. The specified images still didn't work. And in regards to 1), I have tried many other ways, this seems to be the best option. – Eric Hasegawa Jun 25 '20 at 18:52
  • 1
    Can you rephrase the question a bit with which specific aspect you are looking at? Leaving apart xpath and indexes, what does your requirement says? – undetected Selenium Jun 25 '20 at 20:08
  • @DebanjanB I'm not sure what you mean by apart xpath, since that's essentially what I'm asking. But what I'm trying to do is just scrape each image individually within the loop. I don't know how to do it outside of the loop. – Eric Hasegawa Jun 25 '20 at 20:38
  • 1
    Constructing an effective _xpath_ is half the work and how you want to traverse/collect is the other part. We have n number of other ways leaving out the index to construct the _xpath_. The main question is how do you want to collect the items? – undetected Selenium Jun 25 '20 at 20:43
  • I want to collect them however I can, I was just thinking the xpath was the most efficient way of doing so. When I have the element (i.e img is defined) everything is good from there. It is just the traversal to get to the img url that I'm struggling with. – Eric Hasegawa Jun 25 '20 at 20:53
  • 1
    Okay, let me ask you the other way round, What issue are you facing as per the solutions you received to your first and second question? – undetected Selenium Jun 25 '20 at 20:55
  • The first pointed me to WebDriverWait, which I implemented, the second helped me to trace the elements with the xpath, which I think I have done, but I have some sort of an error here. So now I am trying to fix the xpath error. – Eric Hasegawa Jun 25 '20 at 21:03
  • 1
    Now, just tell me about the error you are facing, at which step/line (nothing else) :) – undetected Selenium Jun 25 '20 at 21:08
  • @DebanjanB in the for loop, the statement ```img = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, xpath)))``` times out after the first few elements. This is likely because it can't find the element with the specified xpath, even though when I manually look at the website the xpath is exact. – Eric Hasegawa Jun 25 '20 at 21:10
  • 1
    Hmmm, now I think, I got your problem. You want to scroll and keep on collecting the elements? Right? – undetected Selenium Jun 25 '20 at 21:11
  • @DebanjanB Correct – Eric Hasegawa Jun 25 '20 at 21:14
  • 1
    I remember the website is full of JS, I will have a fresh look tomorrow morning first hour. – undetected Selenium Jun 25 '20 at 21:15
  • 1
    @DebanjanB That would be amazing! – Eric Hasegawa Jun 25 '20 at 21:16
  • @DebanjanB any updates? – Eric Hasegawa Jun 26 '20 at 16:24

1 Answers1

0

I found the problem. The xpath was fine, but after the first 4-5 elements, the images are lazy-loaded. This means that a different solution must be reached in order to scrape these images. It's not that they take too long to load, it's that they just load placeholders in the HTML.

Eric Hasegawa
  • 169
  • 1
  • 14