0

i have html pages, with this code :

<span itemprop="title" data-andiallelmwithtext="15" aria-current="page" aria-label="you in page number 452">page 452</span>

i want to find the aria-label, so i have tried this:

is_452 = soup.find("span", {"aria-label": "you in page number 452"})
print(is_452)

i want to get the result :

is_452 =page 452

i'm getting the result:

is_452=none

how to do it ?

Dvir Yadae
  • 93
  • 3
  • 13

3 Answers3

1

It has line breaks in it, so it doesn't match through text.Try the following

from simplified_scrapy.simplified_doc import SimplifiedDoc
html='''<span itemprop="title" data-andiallelmwithtext="15" aria-current="page" aria-label="you in page
number 452">page 452</span>'''
doc = SimplifiedDoc(html)
is_452 = doc.getElementByReg('aria-label="you in page[\s]*number 452"',tag="span")
print (is_452.text)
dabingsou
  • 2,469
  • 1
  • 5
  • 8
  • it doesn't work, i'm getting exception all the time – Dvir Yadae Jan 14 '20 at 14:52
  • There is a problem with one version. If you can run it before and update it later, you may have used the problematic version. I have modified the above code, or you can update the library. Please try again. Please let me know if you have any questions. – dabingsou Jan 15 '20 at 02:20
  • what should be in html ? i'm doing it like this : 'soup = BeautifulSoup(res.text, "html.parser")' and then 'oc = SimplifiedDoc(soup)' – Dvir Yadae Jan 15 '20 at 11:44
  • Simplifieddoc has only one parameter and does not depend on other libraries. SimplifiedDoc(res.text) Here's an example: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples – dabingsou Jan 16 '20 at 02:04
0

Possibly the desired element is a dynamic element and you can use Selenium to extract the value of the aria-label attribute inducing WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:

  • Using CSS_SELECTOR:

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "section#header a.cart-heading[href='/cart']"))).get_attribute("aria-label"))
    
  • Using XPATH:

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//section[@id='header']//a[@class='cart-heading' and @href='/cart']"))).get_attribute("aria-label"))
    
  • Note : You have to add the following imports:

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
0

The reason soup fails in doing this is because of the line break. I have a simpler solution which doesn't use any separate library, just BeautifulSoup only. I know this question is old, but it has 1k views so it's clear many people search up this question. You can use triple-quote strings to take into account the newline. This:

is_452 = soup.find("span", {"aria-label": "you in page number 452"})
print(is_452)

Would become:

search_label = """you in page
number 452"""
is_452 = soup.find("span", {"aria-label": search_label})
print(is_452)
WhiteWood
  • 103
  • 1
  • 6