Note: Can be any solution, selenium seems like the most likely tool to solve this.
Imgur has albums, the image links of the albums are stored in (a React element?) GalleryPost.album_image_store._.posts.{ALBUM_ID}.images
(thanks to this guy for figuring this out).
Using React DevTools extension for chrome I can see this array of image links, but I want to be able to access this from a python script.
Any ideas how?
P.s. I don't know much at all about react, so please excuse my if this is a stupid question or for possibly using incorrect terminology.
Here's the album I've been working with: https://i.stack.imgur.com/545pu.jpg
Implemented Solution:
Huge thanks to Eduard Florinescu for working with me to figure all this out. Didn't know hardly anything about selenium, how to run javascript in selenium, or any commands I could use.
Modifying some of his code, I ended up with the following.
from time import sleep
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# Snagged from: https://stackoverflow.com/a/480227
def rmdupe(seq):
# Removes duplicates from list
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]
chrome_options = Options()
chrome_options.add_argument("--headless")
prefs = {"profile.managed_default_content_settings.images":2}
chrome_options.add_experimental_option("prefs",prefs)
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.set_window_size(1920, 10000)
driver.get("https://i.stack.imgur.com/545pu.jpg")
links = []
for i in range(0, 10): # Tune as needed
soup = BeautifulSoup(driver.page_source, 'html.parser')
for div in soup.find_all('div', {'class': 'image post-image'}):
imgs = div.find_all('img')
for img in imgs:
srcs = img.get_attribute_list('src')
links.extend(srcs)
sources = div.find_all('source')
for s in sources:
srcs = s.get_attribute_list('src')
links.extend(srcs)
links = rmdupe(links) # Remove duplicates
driver.execute_script('window.scrollBy(0, 750)')
sleep(.2)
>>> len(links)
# 36 -- Huzzah! Got all the album links!
Notes:
Creates a headless chrome instance, so the code can be implemented in a script or potentially a larger project.
I used BeautifulSoup because it's a bit easier to work with and I was having some issues with finding elements and accessing their values using selenium (likely due to inexperience).
Set the display size to be really "tall" so more image links are loaded at once.
Disabled images in chrome browser settings to stop the browser from actually downloading the images (all I need are the links).
Some links are .mp4 files and are rendered in html as
video
elements with<source>
tags contained inside which contain the link. The portion of code starting withsources = div.find_all('source')
is there to make sure no album links are lost.