GITHUB LINK TO THE SCRIPT
PROBLEM DESCRIPTION
Basically, I made a script which downloads manga images from https://mangadex.org.
The script technically works fine, but returns "Max Retries Exceed" in the beginning of the second iteration through the loop... Which doesn't make sense to me, considering the url is updated every iteration and is only called once, how can there have been multiple retries when it was only called once?
The problem doesn't seem to be client-side, but neither server-side, as the images are downloaded fine for the first iteration, but it's rather odd...
Here are the steps taken in the script:
Crawl all the existing titles at https://mangadex.org/, store at "index.json", if "index.json" already exists, load the file. (WORKING)
Parse the ".xml" file imported from Myanimelist and return all the manga titles from it. (WORKING)
Loop through all the titles that are both in "index.json" and the pasred ".xml" file. (WORKING)
For each manga, create a directory with the title, get the source code of the title's homepage through requests and find how many pages there are. (WORKING)
Loop through each of the pages, for each page get all the manga titles and their links, for mangas that are either in English or Portuguese. (WORKING)
After crawling the data from the title's homepage, loop through a zipped instance of the chapter's title and its url. (WORKING)
Create a directory inside the manga directory, named as the current iteration/chapter (1, 2, 3 etc.). Inside the newly created folder, create a folder named 'EN' (where only chapters in english will be stored). Inside the newly created 'EN' folder, create a folder with the actual chapter's name. (The reason for creating a folder with the chapters name, is because sometimes there are chapters missing for an specific language, if I used the iteration folder's number to know the current chapter, I would be in the right iteration, but possibly not in the correct chapter.) (WORKING)
For each chapter link for the current title, go to its first page using Selenium's chrome webdriver. (The contents are rendered in JavaScript) (WORKING)
When in the first chapter's page, get how many pages there are in the chapter. Download every image inside the range of, and including, the last page, to the newly created chapter folder.
That's it. The loop then restarts at the next chapter. When all the chapters for the current title are done, a new loop would start with a new manga.
It works fine, as inteded...
However, after the first complete loop cycle (after downloading all the pages for the current chapter, then looping over the next chapter), I get an exception. Every time the script is run, with different ip addresses and different titles. It also completely downloads the first chapter of the specified everytime.
From what it seems, after the first cycle, at the line in which Selenium loads the first chapter, this error message returns.
I have a NordVPN subscription, so I re-routed my IP multiple times, and still got the same error.
Also, if the images are already downloaded in the folder they're supposed, the script just skips the current chapter and starts downloading the next one, so even without downloading ANYTHING, I still get this error message.
Any thoughts on what might be causing this issue?
ERROR
DevTools listening on ws://127.0.0.1:51146/devtools/browser/b6d08910-ea23-4279-b9d4-6492e6b865d0
Traceback (most recent call last):
File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\connection.py", line 159, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw)
File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\util\connection.py", line 80, in create_connection
raise err
File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\util\connection.py", line 70, in create_connection
sock.connect(sa)
ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen
chunked=chunked)
File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\connectionpool.py", line 354, in _make_request
conn.request(method, url, **httplib_request_kw)
File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1229, in request
self._send_request(method, url, body, headers, encode_chunked)
File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1275, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1224, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1016, in _send_output
self.send(msg)
File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 956, in send
self.connect()
File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\connection.py", line 181, in connect
conn = self._new_conn()
File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\connection.py", line 168, in _new_conn
self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x000002128FCDD518>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:/Programming/Python/Projects/Mangadex.downloader/main.py", line 154, in <module>
driver.get(chapter_start_url)
File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 333, in get
self.execute(Command.GET, {'url': url})
File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 319, in execute
response = self.command_executor.execute(driver_command, params)
File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\selenium\webdriver\remote\remote_connection.py", line 374, in execute
return self._request(command_info[0], url, body=data)
File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\selenium\webdriver\remote\remote_connection.py", line 397, in _request
resp = self._conn.request(method, url, body=body, headers=headers)
File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\request.py", line 72, in request
**urlopen_kw)
File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\request.py", line 150, in request_encode_body
return self.urlopen(method, url, **extra_kw)
File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\poolmanager.py", line 323, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\connectionpool.py", line 667, in urlopen
**response_kw)
File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\connectionpool.py", line 667, in urlopen
**response_kw)
File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\connectionpool.py", line 667, in urlopen
**response_kw)
File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "C:\Users\alexT\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\util\retry.py", line 398, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=51139): Max retries exceeded with url: /session/4f72fba8650ac3ead558cb25172b4b38/url (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000002128FCDD518>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))
OBJECTIVE
I'm making a script which parses Manga titles from your exported MyAnimeList (probably works for Anilist as well) XML lists and downloads all the listed titles that exist at https://mangadex.org
Modules I'm using: requests, re, Beautiful Soup, json, os, selenium, time and urllib
Requests - Used to get the source code of pages that had the information I needed
Re - Used regex to parse the ".xml" file that contains your manga list, exported from https://myanimelist.net and to change the link of the current image to be downloaded, when inside a chapter. (The links always end in ".jpg" or ".png", have a number before the extension, which is the number of the current page, and before the number it has a random letter)
Beautiful Soup - Used to parse the response from requests, parse the titles, links to titles, title of chapters, link to chapters, etc...
JSON - Used to store and load the data from the parsed manga list to/from "index.json"
OS - Used to check if file/directory exists.
Selenium - Used only when inside the chapters, as the reader uses JavaScript to load the image (which is what will be downloaded) and how many pages there are in the current chapter (taken as basis to loop through the images, as they have the same title and the only thing that changes in the url is the current page).
Time - Used only once, after Selenium loads the chapter page, so that the page is fully loaded.
Urllib - Used to download the chapter images.
PS - MyAnimeList and Anilist are indexes for anime series and manga series, where you have lists for both mangas and anime series, in which you can set tags for each item of the list. (If you plan on reading the manga, watchin the anime, if it's completed, etc...)