How to extract the title and src of an image with Beautifulsoup or Selenium?

Question

So i have all the page content with:

content = driver.page_source
soup = BeautifulSoup(content, features="html.parser")

then, i did this:

idioma = soup.select(".idioma > span:nth-child(1)")

Which gave me this:

[<span>
<img alt="Idioma Aleman" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ale.png" title="Idioma Aleman"/>
<img alt="Idioma Chino-tradicional" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/chi.png" title="Idioma Chino-tradicional"/>
<img alt="Idioma Coreano" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/cor.png" title="Idioma Coreano"/>
<img alt="Idioma Español" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/esp.png" title="Idioma Español"/>
<img alt="Idioma Español-latino" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/esp.png" title="Idioma Español-latino"/>
<img alt="Idioma Frances" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/fra.png" title="Idioma Frances"/>
<img alt="Idioma Ingles" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ing.png" title="Idioma Ingles"/>
<img alt="Idioma Italiano" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ita.png" title="Idioma Italiano"/>
<img alt="Idioma Portugues" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/por.png" title="Idioma Portugues"/>
<img alt="Idioma Ruso" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/rus.png" title="Idioma Ruso"/>
</span>]

when i do this to obtain the titles:

idioma = [''.join(elem.find('img')['title']) for elem in idioma if elem]

i only got the first one.

['Idioma Aleman']

Why im not getting everyone?

HedgeHog · Answer 1 · 2020-12-18T07:38:32.170

Why yo not getting all title?

It is because there is only one element in idioma und you use find() that only get the first match.

What you can do is something like this:

idioma = [''.join(elem['title']) for elem in idioma.findAll('img')]
print (idioma)

Output

['Idioma Aleman', 'Idioma Chino-tradicional', 'Idioma Coreano', 'Idioma Español', 'Idioma Español-latino', 'Idioma Frances', 'Idioma Ingles', 'Idioma Italiano', 'Idioma Portugues', 'Idioma Ruso']

Working example in addition based on comment

import bs4

content ='''<span>
<img alt="Idioma Aleman" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ale.png" title="Idioma Aleman"/>
<img alt="Idioma Chino-tradicional" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/chi.png" title="Idioma Chino-tradicional"/>
<img alt="Idioma Coreano" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/cor.png" title="Idioma Coreano"/>
<img alt="Idioma Español" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/esp.png" title="Idioma Español"/>
<img alt="Idioma Español-latino" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/esp.png" title="Idioma Español-latino"/>
<img alt="Idioma Frances" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/fra.png" title="Idioma Frances"/>
<img alt="Idioma Ingles" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ing.png" title="Idioma Ingles"/>
<img alt="Idioma Italiano" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ita.png" title="Idioma Italiano"/>
<img alt="Idioma Portugues" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/por.png" title="Idioma Portugues"/>
<img alt="Idioma Ruso" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/rus.png" title="Idioma Ruso"/>
</span>'''

soup = bs4.BeautifulSoup(content)

Following makes the difference:

idiomaSpan = soup.select_one('span')

idioma = [''.join(elem['title']) for elem in idiomaSpan.find_all('img')]
print (idioma)

it doesn't work, this is the console error: ```AttributeError: ResultSet object has no attribute 'findAll'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?``` — regex, Dec 18 '20 at 00:59
That is caused of the `select()` that finds multiple instances and returns a list, so you have to iterate in addition - Think you can use `select_one()` instead, that only get the first occurrence equivalent to `find()`. Added an example to my answer. — HedgeHog, Dec 18 '20 at 07:48

score -1 · Answer 2 · answered Dec 16 '20 at 08:51

To extract the title and src attributes from all of the <span> using Selenium and python you have to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

Using CSS_SELECTOR for title:

print([my_elem.get_attribute("title") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".idioma > span:nth-child(1) img.post_flagen[alt^='Idioma']")))])

Using XPATH for src:

print([my_elem.get_attribute("src") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//*[contains(@class, 'idioma')]//span//img[starts-with(@alt, 'Idioma') and @class='post_flagen']")))])

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

How to extract the title and src of an image with Beautifulsoup or Selenium?

2 Answers2