3

Note: Can be any solution, selenium seems like the most likely tool to solve this.

Imgur has albums, the image links of the albums are stored in (a React element?) GalleryPost.album_image_store._.posts.{ALBUM_ID}.images (thanks to this guy for figuring this out).

Using React DevTools extension for chrome I can see this array of image links, but I want to be able to access this from a python script.

Any ideas how?

P.s. I don't know much at all about react, so please excuse my if this is a stupid question or for possibly using incorrect terminology.

Here's the album I've been working with: https://i.stack.imgur.com/545pu.jpg

Implemented Solution:

Huge thanks to Eduard Florinescu for working with me to figure all this out. Didn't know hardly anything about selenium, how to run javascript in selenium, or any commands I could use.

Modifying some of his code, I ended up with the following.

from time import sleep

from bs4 import BeautifulSoup
from selenium import webdriver  
from selenium.webdriver.chrome.options import Options


# Snagged from: https://stackoverflow.com/a/480227
def rmdupe(seq):
    # Removes duplicates from list
    seen = set()
    seen_add = seen.add
    return [x for x in seq if not (x in seen or seen_add(x))]


chrome_options = Options()  
chrome_options.add_argument("--headless")  

prefs = {"profile.managed_default_content_settings.images":2}
chrome_options.add_experimental_option("prefs",prefs)

driver = webdriver.Chrome(chrome_options=chrome_options)
driver.set_window_size(1920, 10000)
driver.get("https://i.stack.imgur.com/545pu.jpg")


links = []
for i in range(0, 10):  # Tune as needed
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    for div in soup.find_all('div', {'class': 'image post-image'}):
        imgs = div.find_all('img')
        for img in imgs:
            srcs = img.get_attribute_list('src')
            links.extend(srcs)
        sources = div.find_all('source')
        for s in sources:
            srcs = s.get_attribute_list('src')
            links.extend(srcs)
    links = rmdupe(links)  # Remove duplicates
    driver.execute_script('window.scrollBy(0, 750)')
    sleep(.2)

>>> len(links)
# 36 -- Huzzah! Got all the album links!

Notes:

  • Creates a headless chrome instance, so the code can be implemented in a script or potentially a larger project.

  • I used BeautifulSoup because it's a bit easier to work with and I was having some issues with finding elements and accessing their values using selenium (likely due to inexperience).

  • Set the display size to be really "tall" so more image links are loaded at once.

  • Disabled images in chrome browser settings to stop the browser from actually downloading the images (all I need are the links).

  • Some links are .mp4 files and are rendered in html as video elements with <source> tags contained inside which contain the link. The portion of code starting with sources = div.find_all('source') is there to make sure no album links are lost.

Eduard Florinescu
  • 16,747
  • 28
  • 113
  • 179
Bobs Burgers
  • 761
  • 1
  • 5
  • 26
  • Can you add link to that page? – Mario Nikolaus Feb 15 '18 at 19:27
  • 1
    Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, [describe the problem](https://meta.stackoverflow.com/questions/254393/what-exactly-is-a-recommendation-question) and what has been done so far to solve it. – undetected Selenium Feb 15 '18 at 19:28
  • @MarioNikolaus Any imgur album will work. Here's an example: https://imgur.com/a/JNzjB. – Bobs Burgers Feb 15 '18 at 19:37
  • At first glance, you could retrieve the links using an XPath (`//div[@class="post-images"]//img`) and doing `get_attribute('src')`, but the thing is the DOM changes as you scroll down... at least it's a start. :P – Mangohero1 Feb 15 '18 at 22:37
  • @Mangohero1 Exactly the problem I'm running into. Being able to access the react components would solve the problem, but I can't find any way to do this. – Bobs Burgers Feb 15 '18 at 22:41

1 Answers1

4

You don't need to know any framework to automate any page. You need to just access the DOM and you can do that with selenium and python. But sometimes some simple Vanilla JavaScript helps.

To get those links you can try and paste this in console:

images_links =[]; images = document.querySelectorAll("img"); for (image of images){images_links.push(image.src)} console.log(images_links)

Also the selenium with python and the above JS snippet is:

import selenium
from selenium import webdriver
from time import sleep
driver = webdriver.Chrome()

driver.get("https://imgur.com/a/JNzjB")
for i in range(0,7): # here you will need to tune to see exactly how many scrolls you need
  driver.execute_script('window.scrollBy(0, 2000)')

sleep(2)
list_of_images_links=driver.execute_script('images_links =[]; images = document.querySelectorAll("img"); for (image of images){images_links.push(image.src)} return images_links;')
list_of_images_links

enter image description here

Update:

you don't need selenium just paste this in an Opera console (see that you enable multiple Downloads) and voila:

document.body.style.zoom=0.1; images=document.querySelectorAll("img"); for (i of images) { var a = document.createElement('a'); a.href = i.src; console.log(i); a.download = i.src; document.body.appendChild(a); a.click(); document.body.removeChild(a); }

same thing beautified for reading:

document.body.style.zoom=0.1;
images = document.querySelectorAll("img");
for (i of images) {
    var a = document.createElement('a');
    a.href = i.src;
    console.log(i);
    a.download = i.src;
    document.body.appendChild(a);
    a.click();
    document.body.removeChild(a);
}

Update 2 Opera webdriver

import os
from time import sleep
from selenium import webdriver
from selenium.webdriver.common import desired_capabilities
from selenium.webdriver.opera import options

_operaDriverLoc = os.path.abspath('c:\\Python27\\Scripts\\operadriver.exe')  # Replace this path with the actual path on your machine.
_operaExeLoc = os.path.abspath('c:\\Program Files\\Opera\\51.0.2830.34\\opera.exe')   # Replace this path with the actual path on your machine.

_remoteExecutor = 'http://127.0.0.1:9515'
_operaCaps = desired_capabilities.DesiredCapabilities.OPERA.copy()

_operaOpts = options.ChromeOptions()
_operaOpts._binary_location = _operaExeLoc

# Use the below argument if you want the Opera browser to be in the maximized state when launching.
# The full list of supported arguments can be found on http://peter.sh/experiments/chromium-command-line-switches/
_operaOpts.add_argument('--start-maximized')

driver = webdriver.Chrome(executable_path = _operaDriverLoc, chrome_options = _operaOpts, desired_capabilities = _operaCaps)


driver.get("https://imgur.com/a/JNzjB")
for i in range(0,7): # here you will need to tune to see exactly how many scrolls you need
  driver.execute_script('window.scrollBy(0, 2000)')

sleep(4)
driver.execute_script("document.body.style.zoom=0.1")
list_of_images_links=driver.execute_script('images_links =[]; images = document.querySelectorAll("img"); for (image of images){images_links.push(image.src)} return images_links;')
list_of_images_links
driver.execute_script('document.body.style.zoom=0.1; images=document.querySelectorAll("img"); for (i of images) { var a = document.createElement("a"); a.href = i.src; console.log(i); a.download = i.src; document.body.appendChild(a); a.click(); document.body.removeChild(a); }')
Community
  • 1
  • 1
Eduard Florinescu
  • 16,747
  • 28
  • 113
  • 179
  • Tried the selenium portion myself, but it doesn't return all the links in the album. As your screenshot shows, it only returns 4 links, though there are 36 in that album. At least it's consistent, I got the exact same links back as you did. – Bobs Burgers Feb 15 '18 at 21:31
  • 1
    The reason it's not getting all the links is because imgur dynamically loads the images based on the scroll position. If you scroll all the way down, you'll only see the last 4 images, hence why only 4 were returned. Is there a way to get all the images that have been loaded instead of the images currently in the html source? This is why I was hoping for a way to query the react props. – Bobs Burgers Feb 15 '18 at 21:37
  • I see that, is a tricky thing, I will try to look why because if I paste `images = document.querySelectorAll("img");` in console after I do the scroll it gets all elements, I think it has some sort of protection – Eduard Florinescu Feb 15 '18 at 21:38
  • It gets only the stuff that are in the view, If I zoom out it gets more – Eduard Florinescu Feb 15 '18 at 21:40
  • What % do you set the zoom to? Could you possibly put it all together in your answer? – Bobs Burgers Feb 15 '18 at 21:45
  • I was on an opera browser I see that chrome still protects, if I use `document.body.style.zoom=0.1` is about driver.execute_script("document.body.style.zoom=0.1") – Eduard Florinescu Feb 15 '18 at 21:47
  • 1
    I will try with opera driver it's a chrome thing, if you wait a bit I didn't use operadriver before – Eduard Florinescu Feb 15 '18 at 21:51
  • No problem, take your time. Do you by chance know of any way to access a react prop from javascript? If I could get access to `GalleryPost.album_image_store._.posts.JNzjB.images` I could get all the image links – Bobs Burgers Feb 15 '18 at 21:55
  • @BobsBurgers I don't know. but doesn't the react code get obfuscated and minimized when deployed, hmmm i better look into the opera thing rather than search for reverse engineering – Eduard Florinescu Feb 15 '18 at 21:59
  • I think is a chrome thing that only sources 4 images I see on both opera browser and chromium the js bit loads all the images – Eduard Florinescu Feb 15 '18 at 22:04
  • Not sure. Following [this guys article](https://spapas.github.io/2016/06/27/download-imgur-album-images/) that I linked in the question shows you how to access the react components. However, neither he nor I have been able to figure out how to access the react components in pure javascript (withou the need for the react devtools extension) – Bobs Burgers Feb 15 '18 at 22:07
  • Great! Could you update your answer with your solution? Also, what opera webdriver version are you using (is it opera webdriver using blink?) – Bobs Burgers Feb 15 '18 at 22:09
  • @BobsBurgers Then you don't need selenium just paste this in an Opera console (see that you enable multiple Downloads) and voila: `document.body.style.zoom=0.1; images=document.querySelectorAll("img"); for (i of images) { var a = document.createElement('a'); a.href = i.src; console.log(i); a.download = i.src; document.body.appendChild(a); a.click(); document.body.removeChild(a); }` Works for me I tested it – Eduard Florinescu Feb 15 '18 at 22:19
  • Sorry I didn't mean to confuse. I want to do this in a python script. I was simply giving his article as reference so you could see where the image links were stored. If I used selenium with an Opera driver, could I run that code in an execute_script to get all the images (figured I'd ask since I don't have an opera webdriver installed atm). – Bobs Burgers Feb 15 '18 at 22:28
  • @BobsBurgers It shoud work, I am still looking on the opera thing since opera webdriver is not that straingt forward and update the answer – Eduard Florinescu Feb 15 '18 at 22:41
  • Thanks. I appreciate it. Currently banging my head against the wall trying to get opera webdriver working on debian 8.... – Bobs Burgers Feb 15 '18 at 22:50
  • 1
    Same on windows it seems that opera support sucks and the only guy working for it quit https://github.com/operasoftware/operachromiumdriver/issues/27 Did you try the code in console works for you ? – Eduard Florinescu Feb 15 '18 at 22:54
  • 1
    I will look once more into this https://stackoverflow.com/questions/31055124/drive-opera-with-selenium-python on how to make opera work and then if it doesn't give up. – Eduard Florinescu Feb 15 '18 at 23:01
  • @BobsBurgers I managed to make opera driver work is messy on Opera driver sees more than 4 images but seems still not all it seems that imgur detects webdriver I don't know what to do at this point I can do no further – Eduard Florinescu Feb 15 '18 at 23:21
  • @BobsBurgers updated with the opera code but still problems I think this nut is hard, it gets into https://stackoverflow.com/a/41220267/1577343 – Eduard Florinescu Feb 15 '18 at 23:24