2

GITHUB LINK TO THE SCRIPT

https://github.com/Lexszin/learning-stuff/blob/master/Python/Web%20Crawling/Mangadex_downloader/main.py

PROBLEM DESCRIPTION

Basically, I made a script which downloads manga images from https://mangadex.org.

The script technically works fine, but returns "Max Retries Exceed" in the beginning of the second iteration through the loop... Which doesn't make sense to me, considering the url is updated every iteration and is only called once, how can there have been multiple retries when it was only called once?

The problem doesn't seem to be client-side, but neither server-side, as the images are downloaded fine for the first iteration, but it's rather odd...

Here are the steps taken in the script:

  1. Crawl all the existing titles at https://mangadex.org/, store at "index.json", if "index.json" already exists, load the file. (WORKING)

  2. Parse the ".xml" file imported from Myanimelist and return all the manga titles from it. (WORKING)

  3. Loop through all the titles that are both in "index.json" and the pasred ".xml" file. (WORKING)

  4. For each manga, create a directory with the title, get the source code of the title's homepage through requests and find how many pages there are. (WORKING)

  5. Loop through each of the pages, for each page get all the manga titles and their links, for mangas that are either in English or Portuguese. (WORKING)

  6. After crawling the data from the title's homepage, loop through a zipped instance of the chapter's title and its url. (WORKING)

  7. Create a directory inside the manga directory, named as the current iteration/chapter (1, 2, 3 etc.). Inside the newly created folder, create a folder named 'EN' (where only chapters in english will be stored). Inside the newly created 'EN' folder, create a folder with the actual chapter's name. (The reason for creating a folder with the chapters name, is because sometimes there are chapters missing for an specific language, if I used the iteration folder's number to know the current chapter, I would be in the right iteration, but possibly not in the correct chapter.) (WORKING)

  8. For each chapter link for the current title, go to its first page using Selenium's chrome webdriver. (The contents are rendered in JavaScript) (WORKING)

  9. When in the first chapter's page, get how many pages there are in the chapter. Download every image inside the range of, and including, the last page, to the newly created chapter folder.

  10. That's it. The loop then restarts at the next chapter. When all the chapters for the current title are done, a new loop would start with a new manga.

It works fine, as inteded...

However, after the first complete loop cycle (after downloading all the pages for the current chapter, then looping over the next chapter), I get an exception. Every time the script is run, with different ip addresses and different titles. It also completely downloads the first chapter of the specified everytime.

From what it seems, after the first cycle, at the line in which Selenium loads the first chapter, this error message returns.

I have a NordVPN subscription, so I re-routed my IP multiple times, and still got the same error.

Also, if the images are already downloaded in the folder they're supposed, the script just skips the current chapter and starts downloading the next one, so even without downloading ANYTHING, I still get this error message.

Any thoughts on what might be causing this issue?

ERROR

DevTools listening on ws://127.0.0.1:51146/devtools/browser/b6d08910-ea23-4279-b9d4-6492e6b865d0
Traceback (most recent call last):
  File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\connection.py", line 159, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw)
  File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\util\connection.py", line 80, in create_connection
    raise err
  File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\util\connection.py", line 70, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\connectionpool.py", line 354, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1224, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1016, in _send_output
    self.send(msg)
  File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 956, in send
    self.connect()
  File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\connection.py", line 181, in connect
    conn = self._new_conn()
  File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\connection.py", line 168, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x000002128FCDD518>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:/Programming/Python/Projects/Mangadex.downloader/main.py", line 154, in <module>
    driver.get(chapter_start_url)
  File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 333, in get
    self.execute(Command.GET, {'url': url})
  File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 319, in execute
    response = self.command_executor.execute(driver_command, params)
  File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\selenium\webdriver\remote\remote_connection.py", line 374, in execute
    return self._request(command_info[0], url, body=data)
  File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\selenium\webdriver\remote\remote_connection.py", line 397, in _request
    resp = self._conn.request(method, url, body=body, headers=headers)
  File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\request.py", line 72, in request
    **urlopen_kw)
  File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\request.py", line 150, in request_encode_body
    return self.urlopen(method, url, **extra_kw)
  File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\poolmanager.py", line 323, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\connectionpool.py", line 667, in urlopen
    **response_kw)
  File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\connectionpool.py", line 667, in urlopen
    **response_kw)
  File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\connectionpool.py", line 667, in urlopen
    **response_kw)
  File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\util\retry.py", line 398, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=51139): Max retries exceeded with url: /session/4f72fba8650ac3ead558cb25172b4b38/url (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000002128FCDD518>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))

OBJECTIVE

I'm making a script which parses Manga titles from your exported MyAnimeList (probably works for Anilist as well) XML lists and downloads all the listed titles that exist at https://mangadex.org

Modules I'm using: requests, re, Beautiful Soup, json, os, selenium, time and urllib

Requests - Used to get the source code of pages that had the information I needed

Re - Used regex to parse the ".xml" file that contains your manga list, exported from https://myanimelist.net and to change the link of the current image to be downloaded, when inside a chapter. (The links always end in ".jpg" or ".png", have a number before the extension, which is the number of the current page, and before the number it has a random letter)

Beautiful Soup - Used to parse the response from requests, parse the titles, links to titles, title of chapters, link to chapters, etc...

JSON - Used to store and load the data from the parsed manga list to/from "index.json"

OS - Used to check if file/directory exists.

Selenium - Used only when inside the chapters, as the reader uses JavaScript to load the image (which is what will be downloaded) and how many pages there are in the current chapter (taken as basis to loop through the images, as they have the same title and the only thing that changes in the url is the current page).

Time - Used only once, after Selenium loads the chapter page, so that the page is fully loaded.

Urllib - Used to download the chapter images.

PS - MyAnimeList and Anilist are indexes for anime series and manga series, where you have lists for both mangas and anime series, in which you can set tags for each item of the list. (If you plan on reading the manga, watchin the anime, if it's completed, etc...)

QHarr
  • 83,427
  • 12
  • 54
  • 101
Slins
  • 83
  • 2
  • 9
  • Can you post the code you have written so far? – C.Nivs Mar 14 '19 at 21:08
  • Also, if anyone has suggestions for improving the script, I'm also glad to hear! =) – Slins Mar 14 '19 at 21:08
  • There's a github link to it at the top of the post, do you want me to paste it in a code block? – Slins Mar 14 '19 at 21:09
  • I hadn't read it, but after reading it, it's pretty much what I expected how it would be and how my post is done. Thanks for the welcome!! I did figure it would not be good to post the whole code in the topic, that's why I linked it to github – Slins Mar 14 '19 at 22:33

1 Answers1

0

I'm not sure if this is 100% relevant, but I encountered a similar error recently. The solution I found was that cookies were not able to be stored, so the site was basically pinging my request between 2 of their servers where one would try to assign my browser a cookie and the other would expect that cookie, but my request wouldn't be sent with it so it referred me back to server 1. The code I found to solve it was to use:

s = requests.session()
s.headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'

I think you're supposed to copy/paste the above lines...I did :) Then get the URL with:

res = s.get(my_URL)
soup = bs4.BeautifulSoup(res.text, 'html.parser')

Using requests.session() like this allows the cookies to be saved, and then sent to the other internal server and processed correctly

Reedinationer
  • 5,661
  • 1
  • 12
  • 33
  • Yeah, I did read about requests' session yesterday when tweaking the script, I will try doing that... Though I think the problem will still remain, as the last callback is in a line where Selenium's webdriver is called rather than requests – Slins Mar 14 '19 at 21:22
  • I re-read my script and I have only used requests once (before entering the loop where the exception occurs), and considering its application, there aren't any problems with the requests... Could be a problem with my urllib request though, but I still think that a problem with Selenium is still much more likely – Slins Mar 14 '19 at 21:25