0

Hi i'm trying to build a manga downloader app, for this reason I'm scraping several sites, however I have a problem once I get the image URL. I can see the image using my browser (chrome), I can also download it, however I can't do the same using any popular scripting library.

Here is what I've tried:

String imgSrc = "https://cdn.mangaeden.com/mangasimg/aa/aa75d306397d1d11d07d66746dae78a36dc78672ae9e97a08cb7abb4.jpg"
Connection.Response resultImageResponse = Jsoup.connect(imgSrc)
                    .userAgent(
                            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
                    .referrer("none").execute();

// output here
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(new java.io.File(String.valueOf(imgPath))));
out.write(resultImageResponse.body());          // resultImageResponse.body() is where the image's contents are.
out.close();

I've also tried this:

URL imgUrl = new URL(imgSrc);
Files.copy(imgUrl.openStream(), imgPath);

Lastly, since I was sure the link works I've tried to download the image using python, but also in this case I get a 403 error

import requests
base_url = "https://cdn.mangaeden.com/mangasimg/d0/d08f07d762acda8a1f004677ab2414b9766a616e20bd92de4e2e44f1.jpg"
res = requests.get(url)

googling I found this Unable to get image url in Mangaeden API Angular 6 which seems really close to my problem, however I don't understand if I'm setting wrong the referrer or it doesn't work at all...

Do you have any tips? Thank you!

Stefano
  • 124
  • 1
  • 10
  • `curl.exe "https://cdn.mangaeden.com/mangasimg/d0/d08f07d762acda8a1f004677ab2414b9766a616e20bd92de4e2e44f1.jpg"` gives error code: 1020 (access denied by cloudflare), so probably some caching or cookie token protection in place – MortenB Dec 27 '21 at 20:30
  • Pasting the URL directly into the browser gives a 403 as well (both using Chrome and using Postman). – BrokenBenchmark Dec 27 '21 at 20:32
  • Well I think is normal Postman/curl doesn't work, they are exactly the same as request library when the configuration is the same. My question is: why the browser can display the image? Does it have some different configuration? @BrokenBenchmark – Stefano Dec 27 '21 at 20:35
  • Sorry, I should have clarified that I used both Chrome and Postman. – BrokenBenchmark Dec 27 '21 at 20:36
  • Oh... That was unexpected, so why am I seeing this image? I've tried to open the link with different browsers and also devices and it works perfectly. I.e. i sent the same link to my phone and then clicked it – Stefano Dec 27 '21 at 20:40

2 Answers2

2

How to fix?

Add some "headers" to your request to show that you might be a "browser", this will give you a 200 as response and you can save the file.

Note This will also work for postman, just overwrite the hidden user agent and you will get the image as response

Example (python)

import requests
headers ={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
url = "https://cdn.mangaeden.com/mangasimg/d0/d08f07d762acda8a1f004677ab2414b9766a616e20bd92de4e2e44f1.jpg"
res = requests.get(url,headers=headers)
with open("image.jpg", 'wb') as f:
        f.write(res.content)
HedgeHog
  • 22,146
  • 4
  • 14
  • 36
0

Someone wrote this answer, but later deleted it, so I will copy the answer in case it can be useful.

AFAIK, you can't download anything else apart from HTML Documents using jsoup.

If you open up Developer Tools on your browser, you can get the exact request the browser has made. With Chrome, it's something like this.

The minimal cURL request would in your case be:

'https://cdn.mangaeden.com/mangasimg/aa/aa75d306397d1d11d07d66746dae78a36dc78672ae9e97a08cb7abb4.jpg'
\   -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21
(KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21' \   --output
image.jpg;

You can refer to HedgeHog's answer for a sample Python solution; here's how to achieve the same in Java using the new HTTP Client:

import java.net.URI; import java.net.http.HttpClient; import
java.net.http.HttpRequest; import
java.net.http.HttpResponse.BodyHandlers; import java.nio.file.Path;
import java.nio.file.Paths;

public class ImageDownload {
    public static void main(String[] args) throws Exception {
        HttpClient client = HttpClient.newHttpClient();
        HttpRequest request = HttpRequest.newBuilder()
            .uri(URI.create("https://cdn.mangaeden.com/mangasimg/aa/aa75d306397d1d11d07d66746dae78a36dc78672ae9e97a08cb7abb4.jpg"))
            .header("user-agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0
Safari/535.21")
            .build();
        client.send(request, BodyHandlers.ofFile(Paths.get("image.jpg")));
    } }

I adopted this solution in my java code. Also, one last bit, if the image is downloaded but you can't open it, it is probably due to a 503 error code in the request, in this case you will just have to perform the request again. You can recognize broken images because the image reader will say something like

Not a JPEG file: starts with 0x3c 0x68

which is <h, an HTML error page instead of the image

Stefano
  • 124
  • 1
  • 10