0

I want to automate downloading pictures from a web source, which uses streams, encoded as Base 64 strings. My Google Chrome browser properly recognizes the data from the source as a JPG picture and shows it.

Now, this page is accessible only to registered users. Should I use Selenium in that case?

So, basically, I want to generate around 1000 url requests and save all streamed pictures on my local disk.

An example of my requested Url:

https://ia800703.us.archive.org/BookReader/BookReaderImages.php?zip=/10/items/nortonreaderan6theast/nortonreaderan6theast_jp2.zip&file=nortonreaderan6theast_jp2/nortonreaderan6theast_1257.jp2&scale=1&rotate=0

The response is a html document with a picture:

<html>
<head>
<meta name="viewport" content="width=device-width, minimum-scale=0.1">
<title>BookReaderImages.php (2447×4005) </title>
</head>
<body style="margin: 0px; background: #0e0e0e;">
<img style="-webkit-user-select: none;cursor: zoom-in;" src="https://ia800703.us.archive.org/BookReader/BookReaderImages.php?zip=/10/items/nortonreaderan6theast/nortonreaderan6theast_jp2.zip&file=nortonreaderan6theast_jp2/nortonreaderan6theast_1257.jp2&scale=1&rotate=0" width="556" height="911">
</body>
</html>

The stream of the picture is a Base 64 string. The browser allows to save it as nortonreaderan6theast_1257.jpg

Any suggestions?

  • possible duplicate of https://stackoverflow.com/questions/17361742/download-image-with-selenium-python – theGuy Aug 14 '18 at 18:48
  • No. It is not a duplicate. You can't use snapshots in my case. The image size is 2447×4005 and it is displayed resized to fit the screen. And as you can see, the image source doesn't point to the picture directly. I suspect that the easiest way to handle this stream is by using `Chrome dev-tools API`. But I am not sure. – CitizenVito Aug 14 '18 at 20:23

1 Answers1

0

I managed to implement a working solution, albeit far from ideal one. For that I used Selenium, chromedriver, and the Chrome extension Click and Save. First of all, once a browser instance has been initiated, I have to install manually the extension. After this, I log into a website, and open a book which I am about to download. I have to go through these steps every time a new instance is created.

Inside the cycle which runs through all the pages (urls) I use:

    driver.get(url) # Selenium method
    ''' Click and Save extension automatically detects the picture and saves it to Downloads directory (or other) in Windows OS'''
    while not os.path.exists(file_path): # wait till the file has been created
            time.sleep(0.5)

Overall, this process is very slow, around 1000 pages in 1 hour. Any improvements are welcome.