Webscraping images in python with selenium and beautifulsoup from an AJAX website

Question

I've spent a long time trying to go through the html, javascript, network traffic, etc, and learning a lot about javascript, blobs, base64 decoding/encoding of images but I still can't seem to figure out how to extract the images in these videos from this website: https://www.jamesallen.com/loose-diamonds/all-diamonds/

Here's what I know: Each video is actually a set of up to 512 images, which are retrieved from a server via files titled setX.bin (X is a number). Then they are parsed via an int array into a blob object (There's also some base64 but I forget where), that is somehow converted into an image.

Following the source code is very difficult as it is purposely written as spaghetti code.

How can I extract each diamond's images and do so efficiently?

My one solution is:

I can get the setX.bin files very easily, and if I just 'pass' them into the javascript functions somehow then I should be good.

My second solution is:

to rotate each diamond manually and extract the images from the cache or something like that.

I'd like to use python to do this.

EDIT: I found javascript here on SO that does gives the 'SecurityError: The operation is not secure'. Here it is:

function exportCanvasAsPNG(id, fileName) {

    var canvasElement = document.getElementById(id);
    canvasElement.crossOrigin = "anonymous";
    var MIME_TYPE = "image/png";

    var imgURL = canvasElement.toDataURL(MIME_TYPE);
    window.console.log(canvasElement);
    var dlLink = document.createElement('a');
    dlLink.download = fileName;
    dlLink.href = imgURL;
    dlLink.dataset.downloadurl = [MIME_TYPE, dlLink.download, dlLink.href].join(':');

    document.body.appendChild(dlLink);
    dlLink.click();
    document.body.removeChild(dlLink);
}

exportCanvasAsPNG("canvas-key-_w5qzvdqpl",'asdf.png');

I ran it from Firefox console. When I ran a similar execute script in python, I got the same issue.

I want to be able to scrape all 360 degree images for each canvas.

Edit2: To make this question simpler, I know how to get the setX.bin files, but I don't know how to covert this collection of images from bin to jpg. Each bin file is multiple jpg files.

Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/187731/discussion-between-akshay-patil-and-monty). — akshay patil, Feb 01 '19 at 11:10

cody · Accepted Answer · 2019-02-05T21:30:04.263

2

The .bin files appear to just contain the jpegs concatenated together with some leading metadata. You can simply iterate through the bytes of the file looking for jpeg file signatures (0xFFD8) and slice out each image:

JPEG_MAGIC = b"\xff\xd8"

with open("set0.bin", "rb") as f:
    s = f.read()

i = 0
start_index = s.find(JPEG_MAGIC)

while True:
    end_index = s.find(JPEG_MAGIC, start_index + 1)

    if end_index == -1:
        end_index = len(s)

    with open(f"out{i:03}.jpg", "wb") as out:
        out.write(s[start_index:end_index])

    if end_index == len(s):
        break

    start_index = end_index

    i += 1

Result:

edited Feb 05 '19 at 21:30

answered Feb 05 '19 at 21:24

cody

11,045
3
21
36

Just checking against a few other sets, but I think you've got it. I'll award the bounty when I check them. Thank you very much! How did you know to search for this signature? – Monty Feb 05 '19 at 22:40
@Monty Files of a particular format generally have unique [signatures](https://en.wikipedia.org/wiki/List_of_file_signatures) in their leading bytes that allow them to be identified. This is how the [file](https://en.wikipedia.org/wiki/File_(command)) command works. – cody Feb 05 '19 at 22:56
Thank you very much for your solution! This works very well for my needs. I have also learned something new! – Monty Feb 06 '19 at 07:00
I hope you will check out ***[this post](https://stackoverflow.com/questions/54626470/cant-store-downloaded-files-in-their-concerning-folders)*** to offer any solution @cody. Thanks in advance. – robots.txt Feb 11 '19 at 08:52
Sorry I didn't catch this in time, but it looks like you have a solution! – Monty Mar 01 '19 at 23:10

Webscraping images in python with selenium and beautifulsoup from an AJAX website

1 Answers1