This is a bit of a long theoretical question about how img
tags really work, for the purposes of web scraping. I've done a lot of research and have seen a bunch of working solutions, but I haven't felt that the core question was answered.
First off, my task: I wish to efficiently scrape ~100k HTML pages from a website and also download images on these pages, while respecting their robots.txt
crawl rate of 3 seconds per page.
First, I built a scraper intending to just crawl the HTML and get a long list of image urls to download on a second pass. But then, I realized that, with ~10 images per page this would be ~1M images. At a 3-second crawl rate, this would take ~30 days.
So I thought: "if I'm scraping using Selenium, the images are getting downloaded anyway! I can just download the images on page-scrape."
Now, my background research: This sent me down a rabbit hole, and I learned that the following options exist to download images on a page without making additional calls:
- You can right-click and "Save as" (SO post)
- You can screenshot the image (SO post)
- Sometimes, weirdly, the image data is loaded into
src
anyway (SO post) - Selenium Wire exists, which is really the best way to address this. (SO Post)
These all seem like viable answers, but on a deeper level, they all (even Selenium Wire**) seem like hacks.
** Selenium Wire allows you to access the data in the requests made by Selenium. This is great, but I naively assumed that when a page is rendered and the images are placed in the img
tags, they're in the page and I shouldn't have to worry about the requests that retrieved them.
Now, finally, my question. I clearly have a fundamental misunderstanding about how the img
tag works.
Why can't one directly access image data through the Selenium driver, which is loading and rendering the images anyway? The images are there; I see the images when the driver loads. Theoretically, naively, I would expect to be able to download whatever is loaded on the page.
The one parallel I know of is with iframes -- you can visually see the content of the iframe, but you can only scrape it after directing Selenium to switch into the frame (background). So naively I assumed there would be a switch
method for img
's as well. The fact that there isn't, and it's not clear how to use Selenium to download the image data, tells me that I'm not really understanding how a browser handles an img
tag.
I understand all the hacks and the workarounds, but my question here is why?