find aria-label in html page using soup python

Question

i have html pages, with this code :

<span itemprop="title" data-andiallelmwithtext="15" aria-current="page" aria-label="you in page number 452">page 452</span>

i want to find the aria-label, so i have tried this:

is_452 = soup.find("span", {"aria-label": "you in page number 452"})
print(is_452)

i want to get the result :

is_452 =page 452

i'm getting the result:

is_452=none

how to do it ?

dabingsou · Answer 1 · 2020-01-15T02:17:19.473

1

It has line breaks in it, so it doesn't match through text.Try the following

from simplified_scrapy.simplified_doc import SimplifiedDoc
html='''<span itemprop="title" data-andiallelmwithtext="15" aria-current="page" aria-label="you in page
number 452">page 452</span>'''
doc = SimplifiedDoc(html)
is_452 = doc.getElementByReg('aria-label="you in page[\s]*number 452"',tag="span")
print (is_452.text)

edited Jan 15 '20 at 02:17

answered Jan 10 '20 at 08:25

dabingsou

2,469
1
5
8

it doesn't work, i'm getting exception all the time – Dvir Yadae Jan 14 '20 at 14:52
There is a problem with one version. If you can run it before and update it later, you may have used the problematic version. I have modified the above code, or you can update the library. Please try again. Please let me know if you have any questions. – dabingsou Jan 15 '20 at 02:20
what should be in html ? i'm doing it like this : 'soup = BeautifulSoup(res.text, "html.parser")' and then 'oc = SimplifiedDoc(soup)' – Dvir Yadae Jan 15 '20 at 11:44
Simplifieddoc has only one parameter and does not depend on other libraries. SimplifiedDoc(res.text) Here's an example: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples – dabingsou Jan 16 '20 at 02:04

score 0 · Accepted Answer · answered Jan 10 '20 at 07:56

Possibly the desired element is a dynamic element and you can use Selenium to extract the value of the aria-label attribute inducing WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:

Using CSS_SELECTOR:

print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "section#header a.cart-heading[href='/cart']"))).get_attribute("aria-label"))

Using XPATH:

print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//section[@id='header']//a[@class='cart-heading' and @href='/cart']"))).get_attribute("aria-label"))

Note : You have to add the following imports:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

i can't work with driver . can i get this data by doing request to the url ? — Dvir Yadae, Jan 14 '20 at 14:53

score 0 · Answer 3 · answered May 26 '21 at 06:54

The reason soup fails in doing this is because of the line break. I have a simpler solution which doesn't use any separate library, just BeautifulSoup only. I know this question is old, but it has 1k views so it's clear many people search up this question. You can use triple-quote strings to take into account the newline. This:

is_452 = soup.find("span", {"aria-label": "you in page number 452"})
print(is_452)

Would become:

search_label = """you in page
number 452"""
is_452 = soup.find("span", {"aria-label": search_label})
print(is_452)

find aria-label in html page using soup python

3 Answers3