0

How do I find elements by class name without repeating the output? I have two class to scrape hdrlnk and results-price. I wrote the code like this:

x = driver.find_elements_by_class_name(['hdrlnk','result-price'])

and it gives me some error. I have another code that I tried and here it is:

x = driver.find_elements_by_class_name('hdrlnk'),
y = driver.find_elements_by_class_name('result-price')
for xs in x:
    for ys in y:
        print(xs.text + ys.text)   

But I got the result like this

sony 5 disc cd changer$40
sony 5 disc cd changer$70
sony 5 disc cd changer$70
sony 5 disc cd changer$190
sony 5 disc cd changer$190
sony 5 disc cd changer$190
sony 5 disc cd changer$190
sony 5 disc cd changer$10

The part of the HTML structure that I am trying to scrape

<p class="result-info">
    <span class="icon icon-star" role="button" title="save this post in your favorites list">
        <span class="screen-reader-text">favorite this post</span>
    </span>
    <time class="result-date" datetime="2019-11-07 18:20" title="Thu 07 Nov 06:20:56 PM">Nov  7</time>
    <a href="https://vancouver.craigslist.org/rch/ele/d/chandeliers/7015824686.html" data-id="7015824686" class="result-title hdrlnk">CHANDELIERS</a>
    <span class="result-meta">
        <span class="result-price">$800</span>
        <span class="result-hood"> (Richmond)</span>
        <span class="result-tags">
            <span class="pictag">pic</span>
        </span>
        <span class="banish icon icon-trash" role="button">
            <span class="screen-reader-text">hide this posting</span>
        </span>
        <span class="unbanish icon icon-trash red" role="button" aria-hidden="true"></span>
        <a href="#" class="restore-link">
            <span class="restore-narrow-text">restore</span>
            <span class="restore-wide-text">restore this posting</span>
        </a>
    </span>
</p>

The first element is repeated but I got the correct value for the second one. How do I correct this error?

JeffC
  • 22,180
  • 5
  • 32
  • 55
draw134
  • 1,053
  • 4
  • 35
  • 84

4 Answers4

5

.find_elements_by_class_name() only takes a single class name. What I would suggest is using a CSS selector to do this job, e.g. .hdrlnk .result-price. The code would look like

prices = driver.find_elements_by_css_selector('.hdrlnk .result-price')

This prints all the prices. If you also want the labels, you will have to write a little more code.

for heading in driver.find_elements_by_css_selector('.hdrlnk'):
    print(heading.text)
    for price in heading.find_elements_by_xpath('./following::span[@class="result-price"]'):
        print('  ' + price.text)

See the docs for all the options to find elements.

CSS selector references:
W3C reference
Selenium Tips: CSS Selectors
Taming Advanced CSS Selectors

Code-Apprentice
  • 81,660
  • 23
  • 145
  • 268
JeffC
  • 22,180
  • 5
  • 32
  • 55
  • I second this approach. – Manmohan_singh Nov 08 '19 at 05:09
  • `x` here is a little challenging to iterate over since you need to take two elements at a time. I'm not saying it can't be done, but it requires some tricks from the itertools recipes. – Code-Apprentice Nov 08 '19 at 05:18
  • @Code-Apprentice Yeah I posted the answer and then went back and reread the question and realized from what OP was printing he might want something other than what I thought based on his original locators. I've since updated my answer to also include this second approach. – JeffC Nov 08 '19 at 05:21
3

I think you don't need nested loop, try your iteration by object length, utilize len method:

x = driver.find_elements_by_class_name('hdrlnk'),
#y = driver.find_elements_by_class_name('result-price')
y = driver.find_elements_by_xpath('//p[@class="result-info"]/span[@class="result-meta"]//span[@class="result-price"]')

print(len(x))
print(len(y))

for i in range(len(x)) :
    print(x[i].text + y[i].text)

UPDATE

Actually I just imagine you want to couple member x with member y, it will looks like this:

x[0] with y[0]
x[1] with y[1]
etc....

So I'm sure you having same number between x and y. Because of that reason I think, I just need x to represent loop (although, also you can use y instead).

If you want to include both of them in the loop, you can use zip. Please learn from other answers in this thread.

For xpath you can see here: Locator Strategies

With copy xpath from inspect element it will give you absolute path. I don't recommend it, because it is very vulnerable to change.

Please see this thread: Absolute vs Relative Xpath

frianH
  • 7,295
  • 6
  • 20
  • 45
  • It works for me sir. Thanks a lot. Could you explain to me deeper why you only use `x` in the len loop. And why does the answers below doesn't work for me? – draw134 Nov 08 '19 at 06:13
  • Ohw sir it has some drawbacks in it. It kinda works but the price is repeated twice befor it is updated. Its like this `ipad (6th gen) 32gb wifi+cellular BNIB , $425 drag 2 platinum with smok tfv12 tank , $425 Brand New Lightning Cable , $100` drag 2 platinum price was getting the price of the ipad which is $425 instead of $100. – draw134 Nov 08 '19 at 06:17
  • @Vince With the above code, I assume you have same number of `x` and `y`. So if you facing repeated twice `price`, it possible you have `y` more than `x`. I've updated the code, please try again and how did it go, I've changed the locator by `xpath` and for make it will print the length of both lists first. – frianH Nov 08 '19 at 10:09
  • Now this works for me. But I have a question, why did you used `xpath` instead of classname? And in your loop statement why you only used `x` variable and not `y` Im a little bit confused of it – draw134 Nov 11 '19 at 01:26
  • how did you come up with that xpath sir? I tried copying the xpath from inspect element and it get me an xpath like this `//*[@id="sortable-results"]/ul/li[1]/p/span[2]/span[1]` – draw134 Nov 11 '19 at 01:34
  • @Vince I've added a bit explanation in the answer, please see. Hope this helps. – frianH Nov 15 '19 at 07:00
2

It looks like you have elements with classes hdrlnk and result-price that come in pairs. So you need to iterate the lists in parallel with zip():

xs = driver.find_elements_by_class_name('hdrlnk'),
ys = driver.find_elements_by_class_name('result-price')
for x, y in zip(xs, ys):
    print(x.text, y.text)

This assumes that the two lists contain the same number of elements in the correct order so that they match up correctly with zip(). It is probably safer to parse them directly from the HTML by iterating over the parent <p> elements:

ps = driver.find_elements_by_class_name('result-info')
for p in ps:
    x = p.find_element_by_class_name('hdrlnk'),
    y = p.find_element_by_class_name('result-price')
    print(x.text, y.text)
Code-Apprentice
  • 81,660
  • 23
  • 145
  • 268
  • I tried the first one and got me an errof of `'list' object has no attribute 'text'` the error is in the `print(x.text, y.text)`. I tried to modify it by `print(xs.text, ys.text)` And got an error of `"message": "Instance of 'tuple' has no 'text' member"` – draw134 Nov 08 '19 at 05:26
  • The second one got also an error of `'tuple' object has no attribute 'text'` And in my terminal I got output like this `[15768:7180:1108/132750.647:ERROR:page_load_metrics_update_dispatcher.cc(166)] Invalid first_paint 2.392 s for first_image_paint 2.388 s`. What should I do sir? – draw134 Nov 08 '19 at 05:28
  • @Vince You appear to be doing something slightly different than what I have here. I don't see how the code I gave can give any of those errors. – Code-Apprentice Nov 08 '19 at 16:06
  • @Vince If you need more help, post a new question with your current code and its errors. – Code-Apprentice Nov 08 '19 at 17:09
  • @Vince I found a mistake in my second example. I don't think the change affects the errors you are seeing, but it does fix a logic error. – Code-Apprentice Nov 08 '19 at 17:11
1

If your usecase is to use find_elements_by _classname() a better approach would be to to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

  • Using CLASS_NAME:

    items = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "hdrlnk")))
    prices = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "result-price")))
    for i,j in zip(items, prices):
        print(i.text + j.text)
    

However a canonical approach will be to use either of the following:

  • CSS_SELECTOR:

    items = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "p.result-info a.hdrlnk")))
    prices = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "p.result-info span.result-meta>span.result-price")))
    for i,j in zip(items, prices):
        print(i.text + j.text)
    
  • XPATH:

    items = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//p[@class='result-info']//a[contains(@class, 'hdrlnk')]")))
    items = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//p[@class='result-info']//span[@class='result-meta']/span[@class='result-price']")))
    for i,j in zip(items, prices):
        print(i.text + j.text)
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
Code-Apprentice
  • 81,660
  • 23
  • 145
  • 268
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • Got some error in importing sir. Like `unable to import selenium.webdriver.support` – draw134 Nov 08 '19 at 07:18
  • @Vince `import` should work irrespective of the underlying code block. Restart your _IDE_. If the error still persists you may have to reinstall _selenium_. – undetected Selenium Nov 08 '19 at 07:31