0

This is a bit of a long theoretical question about how img tags really work, for the purposes of web scraping. I've done a lot of research and have seen a bunch of working solutions, but I haven't felt that the core question was answered.


First off, my task: I wish to efficiently scrape ~100k HTML pages from a website and also download images on these pages, while respecting their robots.txt crawl rate of 3 seconds per page.

First, I built a scraper intending to just crawl the HTML and get a long list of image urls to download on a second pass. But then, I realized that, with ~10 images per page this would be ~1M images. At a 3-second crawl rate, this would take ~30 days.

So I thought: "if I'm scraping using Selenium, the images are getting downloaded anyway! I can just download the images on page-scrape."


Now, my background research: This sent me down a rabbit hole, and I learned that the following options exist to download images on a page without making additional calls:

  • You can right-click and "Save as" (SO post)
  • You can screenshot the image (SO post)
  • Sometimes, weirdly, the image data is loaded into src anyway (SO post)
  • Selenium Wire exists, which is really the best way to address this. (SO Post)

These all seem like viable answers, but on a deeper level, they all (even Selenium Wire**) seem like hacks.

** Selenium Wire allows you to access the data in the requests made by Selenium. This is great, but I naively assumed that when a page is rendered and the images are placed in the img tags, they're in the page and I shouldn't have to worry about the requests that retrieved them.


Now, finally, my question. I clearly have a fundamental misunderstanding about how the img tag works.

Why can't one directly access image data through the Selenium driver, which is loading and rendering the images anyway? The images are there; I see the images when the driver loads. Theoretically, naively, I would expect to be able to download whatever is loaded on the page.

The one parallel I know of is with iframes -- you can visually see the content of the iframe, but you can only scrape it after directing Selenium to switch into the frame (background). So naively I assumed there would be a switch method for img's as well. The fact that there isn't, and it's not clear how to use Selenium to download the image data, tells me that I'm not really understanding how a browser handles an img tag.

I understand all the hacks and the workarounds, but my question here is why?

Alex Spangher
  • 977
  • 2
  • 13
  • 22
  • 1
    the images are already downloaded by the browser so that it can display them. Just check the tmp directory. For instance Windows version of Chrome would use %LOCALAPPDATA%\Google\Chrome\User Data\Default\Cache – pcalkins Feb 09 '21 at 23:57
  • 1
    further... the way the WWW works is via links to files... when you use The web browser knows to download from "http://current_domain.com/images/funny.gif". It's a relative path... relative to the current domain. It can also be absolute like your avatar: (In that case the server is sending the file based on some query vars...) It's all URLs... (Universal Resource Locators) – pcalkins Feb 10 '21 at 00:05
  • 1
    The Selenium webdriver only sends/receives commands/data to/from the browser. The browser is what is downloading the content. There's no need to "download" anything, since the browser has already done that. The webdriver communicates directly with the browser via the wire protocol: https://www.w3.org/TR/webdriver/ – pcalkins Feb 10 '21 at 00:11
  • 1
    thanks so much @pcalkins, this makes a ton of sense. I see that you can set a tmp path for the browser cache dir using: options = webdriver.ChromeOptions() options.add_argument("user-data-dir=tmp") I'll try to come up with a generic solution for finding images in the cache but if you have one first I'll accept your answer. Thanks for the help! It's a lot clearer to me now – Alex Spangher Feb 10 '21 at 00:15
  • 1
    You should be aware that Selenium is designed as a testing tool. The browser is your scraper. By default, the browser will cleanup all that data when the webdriver is quit. (It assumes you no longer want to keep that info and would want a "fresh state" for the next driver session.) cURL may be a better option for you: https://www.php.net/manual/en/book.curl.php – pcalkins Feb 10 '21 at 00:26
  • There are scraping cases -- ex. sites with lots of javascript, sites that have dynamic layouts, sites where you run a search against a database -- where you need a webbrowser instead of just curl to render the page. – Alex Spangher Feb 10 '21 at 00:35
  • Followup question -- do you (or anyone) know if img requests count towards a site's `robot.txt`- specified crawl rate? – Alex Spangher Feb 10 '21 at 01:06
  • are you reading the robots.txt file? – pcalkins Feb 10 '21 at 17:25
  • Yes, it says `Friendly, low-speed bots are welcome viewing article pages`, and `# wget in recursive mode uses too many resources for us. Please wait 3 seconds between each request.`... I assume "recursive mode" refers to queries made to collect images, etc? – Alex Spangher Feb 10 '21 at 19:53
  • recursion is generally a multi-threaded process used to navigate to every link on a site or traverse every directory in a "tree"... you won't need to worry about that. I would just make sure that you include 3 second thread sleeps between each driver.get(), driver.navigate(), or click() navigation commands. Congratulations for being a good internet citizen! Most people don't bother to read the robots.txt file. – pcalkins Feb 10 '21 at 20:55
  • Thanks -- I've been hacking at it, and it's a pity but it seems too difficult to manipulate the local browser cache, besides that might be subject to change with different versions of Chrome/ other browsers, so it might not be worthwhile scoping out as a general solution. It certainly would have been elegant, though.. – Alex Spangher Feb 11 '21 at 01:44

0 Answers0