Python error when trying to retrieve image links

Question

I am getting the error AttributeError: 'NoneType' object has no attribute 'findAll' everytime I run the below Python script. I have done some research and found a few posts that state perhaps I am passing 'None' when trying to find the images, which is why it errors. I still have no solution though. Any information is helpful.

Here is the full error:

Traceback (most recent call last):
  File "D:\Program Files\Parser Python\Test.py", line 33, in <module>
    for img in divImage.findAll('img'):
AttributeError: 'NoneType' object has no attribute 'findAll'


    from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.common.exceptions import TimeoutException
import os

firefox_capabilities = DesiredCapabilities.FIREFOX
firefox_capabilities['marionette'] = True
firefox_capabilities['binary'] = 'C:\Program Files (x86)\Mozilla Firefox\firefox.exe'


os.environ["PATH"] += "C:\Python27\Lib\site-packages\selenium-2.53.6-py2.7.egg\selenium"
#binary = FirefoxBinary('C:\Program Files (x86)\Mozilla Firefox\firefox.exe')
driver = webdriver.Firefox(capabilities=firefox_capabilities)
# it takes forever to load the page, therefore we are setting a threshold
driver.set_page_load_timeout(5)

try:
    driver.get("http://readcomiconline.to/Comic/Flashpoint/Issue-1?id=19295&readType=1")
except TimeoutException:
    # never ignore exceptions silently in real world code
    pass

soup2 = BeautifulSoup(driver.page_source, 'html.parser')
divImage = soup2.find('div', {"id": "divImage"})
#divImage = soup2.find('div', {"id": "containerRoot"})

# close the browser 
driver.close()

for img in divImage.findAll('img'):
    print img.get('src')

Does this topic help you out? Link: http://stackoverflow.com/questions/18065768/python-attributeerror-nonetype-object-has-no-attribute-findall — Tenzin, Jan 11 '17 at 14:42
This might be relevant too http://stackoverflow.com/questions/31419641/python-scraper-unable-to-scrape-img-src. BTW why the Java tag? — doctorlove, Jan 11 '17 at 14:43
Omg that was an accident. I originally was trying to accomplish this with JSOUP in java, so i guess i have java on my mind. My apologies. — Hunter Zolomon, Jan 11 '17 at 14:44
you should be able to delete it when editing your question. Also, no problem if it was just an accident. But here on stackoverflow there are sometimes people who just add as many language tags to their question as possible in hopes of getting a quicker answer, hence my first comment. But as i said, if it was just an honest mistake then forget about it. Looks like someone already edited the question and its just waiting to get reviewed. — OH GOD SPIDERS, Jan 11 '17 at 14:50
I am fairly new to this so I apologize. I just approved the edit. I will review the links and Answer — Hunter Zolomon, Jan 11 '17 at 14:53
Thank you Tenzin and doctorlove. I had already looked at those links which gave me the idea of installing selenium. — Hunter Zolomon, Jan 11 '17 at 15:13

alecxe · Accepted Answer · 2017-01-11T15:39:51.527

1

The error means that divImage is None, which means that the div element with id="divImage" was not found in the parsed HTML.

You should first wait for the desired element to become present on the page and only then get the page source and parse it. This can be done with WebDriverWait:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# ...

driver.get("http://readcomiconline.to/Comic/Flashpoint/Issue-1?id=19295&readType=1")

wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.ID, "divImage")))

soup2 = BeautifulSoup(driver.page_source, 'html.parser')

Note that to wait for all images to be loaded, you should continuously scroll the page to the footer until all the images are loaded, implementation:

driver.get("http://readcomiconline.to/Comic/Flashpoint/Issue-1?id=19295&readType=1")
wait.until(EC.presence_of_element_located((By.ID, "divImage")))

footer = driver.find_element_by_id("footer")

while True:
    # scroll to the footer
    driver.execute_script("arguments[0].scrollIntoView();", footer)
    time.sleep(0.5)

    # check if all images are loaded
    if all(img.get_attribute("src") for img in driver.find_elements_by_css_selector("#divImage p img")):
        break

Don't forget to import time.

edited Jan 11 '17 at 15:39

answered Jan 11 '17 at 14:43

alecxe

462,703
120
1,088
1,195

That seems to have done the trick. The page does take quite long to fully load as there are many images that are displayed. I am now able to print out the 'src' text, but I only get two lines for two images. Could this be because the other images weren't loaded at that time as well? – Hunter Zolomon Jan 11 '17 at 15:07
@HunterZolomon good point, this part is not as easy, updated the answer - check it out. – alecxe Jan 11 '17 at 15:40
Hmm now I am getting this error: selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: [id="footer"] I inspected the webpage and there certainly is an element with the ID of "footer". – Hunter Zolomon Jan 11 '17 at 15:56
@HunterZolomon ah, may be timing problem again, try waiting for it as well. Thanks. – alecxe Jan 11 '17 at 15:58
Yea i did a wait for the footer and it works now. Takes a few minutes because of the wait for the all the images. Thanks a lot for your help! Time to move on to the next step. – Hunter Zolomon Jan 11 '17 at 16:05

Python error when trying to retrieve image links

1 Answers1