Bypass Referral Denied error in selenium using python

Question

I was making a script to download images from comic naver and I'm kind of done with it, however I can't seem to save the images. I successfully grabbed the images via urlib and BeasutifulSoup, now, seems like they've introduced hotlink blocking and I can't seem to save the images on my system via urlib or selenium.

Update: I tried changing the useragent to see if that was causing problems... still the same.

Any fix or solution?

My code right now :

import requests
from bs4 import BeautifulSoup
import re
import urllib
import urllib2
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException


dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
    "Chrome/15.0.87"
)

url = "http://comic.naver.com/webtoon/detail.nhn?titleId=654817&no=44&weekday=tue"
driver = webdriver.PhantomJS(desired_capabilities=dcap)

soup = BeautifulSoup(urllib.urlopen(url).read())
scripts = soup.findAll('img', alt='comic content')

for links in scripts:
    Imagelinks = links['src']
    filename = Imagelinks.split('_')[-1]
    print 'Downloading Image : '+filename
    driver.get(Imagelinks)
    driver.save_screenshot(filename)


driver.close()

Following 'MAI's' reply, I tried what I could with selenium, and got what I wanted. It's solved now. My code :

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
from selenium.webdriver.common.action_chains import ActionChains



driver = webdriver.Chrome()
url = "http://comic.naver.com/webtoon/detail.nhn?titleId=654817&no=44&weekday=tue"
driver.get(url)

elem = driver.find_elements_by_xpath("//div[@class='wt_viewer']//img[@alt='comic content']")

for links in elem:
    print links.get_attribute('src')


driver.quit()

but, when I try to taek screenshots of this, it shows that the "element is not attached to the page". Now, how am I supposed to solve that :/

raehik · Answer 1 · 2016-08-30T13:30:30.613

(Note: Apologies, I'm not able to comment, so I have to make this an answer.)

To answer your original question, I've just been able to download an image in cURL from Naver Webtoons (the English site) by adding a Referer: http://www.webtoons.com header like so:

curl -H "Referer: http://www.webtoons.com" [link to image] > img.jpg

I haven't tried, but you'll probably want to use http://comic.naver.com instead. To do this with urllib, create a Request object with the header required:

req = urllib.request.Request(url, headers={"Referer": "http://comic.naver.com"})
with urllib.request.urlopen(req) as response, open("image.jpg", "wb") as outfile:

Then you can save the file using shutil.copyfileobj(src, dest). So instead of taking screenshots, you can simply get a list of all the images to download, then make a request for each one using the referer header.

Edit: I have a working script on GitHub which only requires urllib and BeautifulSoup.

score 1 · Accepted Answer · answered Mar 18 '16 at 04:32

1

I took a short look at the website with Chrome dev tools.

I would suggest you to download the image directly instead of screen-shooting. Selenium webdriver should actually run the javascripts on PhantomJS headless browser, so you should get images loaded by javascript at the following path.

The path that I am getting by eye-balling the html is

html body #wrap #container #content div #comic_view_area div img

The image tags in the last level have IDs like content_image_N, N counting from 0. So you can also get specific picture by using img#content_image_0 for example.

answered Mar 18 '16 at 04:32

Patrick the Cat

2,138
1
16
33

That I saw as well. I'm not particularly good with selenium. The code I posted is giving me the links, which is not a problem...but, I can't seem to use the links to save the images, so I tried screen-shot approach. When I try to go to the image links from my phantomJS or firefox, I get that referal denied error. But, in chrome it works. Will try to execute same in chrome driver and post the results. – Xonshiz Mar 18 '16 at 06:07
@Xonshiz you need to keep session cookies I think. http://stackoverflow.com/questions/15058462/how-to-save-and-load-cookies-using-python-selenium-webdriver – Patrick the Cat Mar 18 '16 at 15:45

Bypass Referral Denied error in selenium using python

2 Answers2