Python Scraper Unable to scrape img src

Question

I'm unable to scrape images from the website www.kissmanga.com . I'm using Python3 and the Requests and Beautifulsoup libraries. The scraped image tags give blank "src".

SRC:

from bs4 import BeautifulSoup
import requests

scraper = cfscrape.create_scraper()

url = "http://kissmanga.com/Manga/Bleach/Bleach-634--Friend-004?id=235206"

response = requests.get(url)

soup2 = BeautifulSoup(response.text, 'html.parser')

divImage = soup2.find('div',{"id": "divImage"})

for img in divImage.findAll('img'):
     print(img)

response.close()

I think image scraping is prevented because I believe the website uses cloudflare. Upon this assumption, I also tried using the "cfscrape" library to scrape the content.

I recently noticed that the images are loaded via javascript. so I just parsed the javascript that contained the code. — ibz, Aug 08 '15 at 01:46

score 3 · Accepted Answer · edited May 23 '17 at 12:14

3

You need to wait for JavaScript to inject the html code for images.

Multiple tools are capable of doing this, here are some of them:

I was able to get it working with Selenium:

from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.common.exceptions import TimeoutException

driver = webdriver.Firefox()
# it takes forever to load the page, therefore we are setting a threshold
driver.set_page_load_timeout(5)

try:
    driver.get("http://kissmanga.com/Manga/Bleach/Bleach-634--Friend-004?id=235206")
except TimeoutException:
    # never ignore exceptions silently in real world code
    pass

soup2 = BeautifulSoup(driver.page_source, 'html.parser')
divImage = soup2.find('div', {"id": "divImage"})

# close the browser 
driver.close()

for img in divImage.findAll('img'):
    print img.get('src')

Refer to How to download image using requests if you also want to download these images.

edited May 23 '17 at 12:14

Community

1
1

answered Jul 15 '15 at 12:07

Dušan Maďar

9,269
5
49
64

is there a way to do this without opening the browser ? BTW your solution works well. Thanks you. – ibz Jul 15 '15 at 20:17
well, I am not sure, maybe with a custom user-agent as mentioned by @Kupiakos; if the only problem with the selenium solution is that it actually opens a browser window, you can use a headless browser like `PhantomJS` – Dušan Maďar Jul 15 '15 at 20:22
take a look at this: http://stackoverflow.com/questions/6025082/headless-browser-for-python-javascript-support-required – Dušan Maďar Jul 15 '15 at 20:24

score 0 · Answer 2 · answered Jul 15 '15 at 16:42

0

Have you tried setting a custom user-agent? It's typically considered unethical to do so, but so is scraping manga.

answered Jul 15 '15 at 16:42

Alyssa Haroldsen

3,652
1
20
35

Python Scraper Unable to scrape img src

2 Answers2

Linked