Scraping all results from page with BeautifulSoup

Question

                              **Update**
         ===================================================

Ok guys, so far so good. I have code that allows me to scrape images, but it stores them in a strange way. It downloads first 40+ images, then creates another 'kittens' folder within previously created 'kittens' folder and starts over (downloading the same images as in first folder). How can I change it? Here is the code:

from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.common.exceptions import WebDriverException
from bs4 import BeautifulSoup as soup
import requests
import time
import os

image_tags = []

driver = webdriver.Chrome()
driver.get(url='https://www.pexels.com/search/kittens/')
last_height = driver.execute_script('return document.body.scrollHeight')

while True:
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
time.sleep(1)
new_height = driver.execute_script('return document.body.scrollHeight')
if new_height == last_height:
    break
else:
    last_height = new_height

sp = soup(driver.page_source, 'html.parser')

for img_tag in sp.find_all('img'):
    image_tags.append(img_tag)


if not os.path.exists('kittens'):
    os.makedirs('kittens')

os.chdir('kittens')

x = 0

for image in image_tags:
    try:
        url = image['src']
        source = requests.get(url)
        with open('kitten-{}.jpg'.format(x), 'wb') as f:
            f.write(requests.get(url).content)
            x += 1
    except:
        pass

===========================================================================

im trying to write a spider to scrape images of kittens from some page. I've got small problem, because my spider only gets first 15 images. I know it's probably because the page is loading more images after scrolling down. How can I resolve this issue? Here is the code:

import requests
from bs4 import BeautifulSoup as bs
import os


url = 'https://www.pexels.com/search/cute%20kittens/'

page = requests.get(url)
soup = bs(page.text, 'html.parser')

image_tags = soup.findAll('img')

if not os.path.exists('kittens'):
    os.makedirs('kittens')

os.chdir('kittens')

x = 0

for image in image_tags:
    try:
        url = image['src']
        source = requests.get(url)
        if source.status_code == 200:
            with open('kitten-' + str(x) + '.jpg', 'wb') as f:
                f.write(requests.get(url).content)
                f.close()
                x += 1
    except:
        pass

"scrolling down" is not something `requests` can do. you can use [selenium to scroll down on the page](https://stackoverflow.com/questions/20986631/how-can-i-scroll-a-web-page-using-selenium-webdriver-in-python/27760083) (an automated browser), and get the links that way. [More info on Selenium](http://selenium-python.readthedocs.io/installation.html) — Sean Breckenridge, Mar 03 '18 at 20:35
@piotrulu, this is not an answer, but a few suggestions for better code writing. 1. Instead of `'kitten-'+str(x)+'.jpg'` use `'kitten-{}.jpg'.format(x)`. 2. When you use `with open(...):`, the `close()` function gets called implicitly when you leave the indented block. So, you don't need to write that explicitly. — Keyur Potdar, Mar 04 '18 at 05:34

Ajax1234 · Answer 1 · 2018-03-03T21:40:41.463

1

Since the site is dynamic, you need to use a browser manipulation tool such as selenium:

from selenium import webdriver
from bs4 import BeautifulSoup as soup
import time
import os
driver = webdriver.Chrome()
driver.get('https://www.pexels.com/search/cute%20kittens/')
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
  driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
  time.sleep(0.5)
  new_height = driver.execute_script("return document.body.scrollHeight")
  if new_height == last_height:
     break
  last_height = new_height

image_urls = [i['src'] for i in soup(driver.page_source, 'html.parser').find_all('img')]
if not os.path.exists('kittens'):
  os.makedirs('kittens')
os.chdir('kittens')
with open('kittens.txt') as f:
  for url in image_urls:
    f.write('{}\n'.format(url))

edited Mar 03 '18 at 21:40

answered Mar 03 '18 at 20:38

Ajax1234

69,937
8
61
102

And in what directory does this script writes images? – piotrulu Mar 03 '18 at 20:42
It doesn't; currently the URLs for the images are just saved in a list called `image_urls`. – Sean Breckenridge Mar 03 '18 at 20:47
@piotrulu this script only demonstrates how to retrieve the images. You can use your current code the write the strings in `image_urls` to a file. – Ajax1234 Mar 03 '18 at 20:49
Still don't get it. Should I add this code somewhere in my existing code and that's it? Or should i change this 'request' bit? I'm really new to programming and I'm a little bit overwhelmed by all of this. – piotrulu Mar 03 '18 at 21:38
@piotrulu please see my recent edit. I added the necessary code to write the urls to a file. – Ajax1234 Mar 03 '18 at 21:41
@Ajax1234 your code only opens new chrome window and does nothing more. Is it possible that my IDE isn't working properly? I use PyCharm – piotrulu Mar 04 '18 at 17:58
@piotrulu you may have to install the proper bindings for `selenium`. Are you including the path to the driver when creating the browser object? i.e `webdriver.Chrome('path/to/driver')` – Ajax1234 Mar 04 '18 at 18:02
@Ajax1234 yes i do – piotrulu Mar 04 '18 at 18:10
@piotrulu is the website loading in the chrome window? – Ajax1234 Mar 04 '18 at 18:12
@Ajax1234 New Chrome window opens up but it's empty and in the adress bar there only is 'data:' and nothing more :/ – piotrulu Mar 04 '18 at 18:14

Scraping all results from page with BeautifulSoup

1 Answers1