Why are some image url can only display inside html , but cannot directly open through url in browser?

Question

Recently, i am trying to download some image from a website. I search the displayed image element inside html. Then, I open the image url on new tab, but it returns 403 Forbidden page. I copy the string and insert it into another pages html and the image can display successfully. I want to ask about the reason of it, and what can i do to download the image. (I am trying to download it through python request.get()) Thank you.

That's quite strange. Since you're saying the image doesn't show up if you copy the URL in a new tab, it's not a `User-Agent` issue, and the image successfully loads when inserted in another html page, it's probably not a `Referer` issue. Can you post links to both the image and the page the image was on? — GordonAitchJay, Mar 12 '20 at 16:08
https://tw.manhuagui.com/comic/35275/481200.html This is the link of a comic website, and the image is actually that comic page. — HA HA chan, Mar 12 '20 at 17:27

score 0 · Answer 1 · answered Mar 12 '20 at 15:33

0

Some websites block requests without a useragent, try this:

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
requests.get(url, headers=headers)

Reference to Python requests. 403 Forbidden

answered Mar 12 '20 at 15:33

damaredayo

1,048
6
19

Thank you for your answer. I tried it already and it still return 403 Forbidden. Actually this problem also appears when i manually open the image url through my browser. – HA HA chan Mar 12 '20 at 17:25

score 0 · Accepted Answer · answered Mar 13 '20 at 06:59

This web server checks the Referer header when you request the image. To successfully download the image, the Referer must be the page the image is on. It doesn't care about the User-Agent. I assume the image showed up when you put it in another page because your browser cached the image, and did not actually request it from the server again.

By using your browser's network monitor tool, you can see how your browser got the image's URL. In this case, the URL wasn't a part of the original html document. Your browser executed some JavaScript that unpacked the URL and inserted an img element into the div element with id="mangaBox". Because of this, you can't use vanilla requests, as it doesn't execute JavaScript. I used Requests-HTML.

The code below downloads the image from the link you gave in your comment, and saves it to disk:

import os, urllib
from requests_html import HTMLSession

session = HTMLSession()
session.headers.update({"User-Agent": r"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0",
                        "Referer": r"https://tw.manhuagui.com/comic/35275/481200.html"
                        })

url = r"https://tw.manhuagui.com/comic/35275/481200.html"

response = session.get(url)
print(response, len(response.content))

response.html.render()

img = response.html.find("img#mangaFile", first=True)
print("img element:", img)

url = img.attrs["src"]
print("image url:", url)

response = session.get(url)
print(response, len(response.content))

filename = os.path.basename(urllib.parse.urlsplit(url).path)
print("filename:", filename)

with open(filename, "wb") as f:
    f.write(response.content)

Output:

<Response [200]> 6715
img element: <Element 'img' alt='在地下城寻找邂逅难道有错吗？ 第00话' id='mangaFile' src='https://i.hamreus.com/ps3/z/zdxcxzxhndyc_sddc/第00话/P0018.jpg.webp?cid=481200&md5=aAAP75PBy9DIa0bb8Hlwfw' class=('mangaFile',) data-tag='mangaFile' style='display: block; transform: rotate(0deg); transform-origin: 50% 50% 0px;' imgw='907'>
image url: https://i.hamreus.com/ps3/z/zdxcxzxhndyc_sddc/第00话/P0018.jpg.webp?cid=481200&md5=aAAP75PBy9DIa0bb8Hlwfw
<Response [200]> 186386
filename: P0018.jpg.webp

For what it's worth, a whole heap of image URLs, in addition to the main image of the current page, are packed in the last script element of the original html document.

<script type="text/javascript">window["\x65\x76\x61\x6c"](function(p,a,c,k,e,d)...

It works! Thank you for your answer. I decided to use request with the 'Referer' because there is some error while I go through response.html.render(). Anyway, you figured out the problem and that's enough for me. — HA HA chan, Mar 13 '20 at 13:32
Great! Don't forget to accept and/or vote up any helpful answers, as per [What should I do when someone answers my question?](https://stackoverflow.com/help/someone-answers). By the way, was the error something like `pyppeteer.errors.NetworkError` or `This event loop is already running`? — GordonAitchJay, Mar 13 '20 at 13:40

Why are some image url can only display inside html , but cannot directly open through url in browser?

2 Answers2