Simple Python Image Scraper Script

Question

It's fairly simple stuff here...So i'm currently experimenting with python, and I have very little experience... I wanted to create an image scraper what goes to page downloads the image clicks link (next page) and downloads other image and so on (as source I use website similar to 9gag). Right now my script can just print the image url and next link url, so I cant figure out how to make my bot click on link and download next image and do it infinitely (until condition met/stopped etc)...

PS im using beautifulsoup4 (i think LOL)

Thanks in advance, Zil

Here what the script look like now, i was kinda combining couple scripts into one, and so the script looks very unclean...

import requests
from bs4 import BeautifulSoup
import urllib

def trade_spider(max_pages):
    page = 1
    while page <= max_pages:
        url2 = 'http://linksmiau.net/linksmi_paveiksliukai/rimtas_rudeninis_ispejimas_merginoms/1819/'
        url = url2
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")

        for img in soup.findAll('img', {'class': 'img'}):
            temp = img.get('src')
            if temp[:1]=="/":
                image = "http://linksmiau.net" + temp
            else:
                image = temp

        print(image)


        for lnk in soup.findAll('div', {'id': 'arrow_right'}):
                nextlink = lnk.get('onclick')
                link = nextlink.replace("window.location = '", "")
                lastlink = "http://linksmiau.net" + link
                page += 1
        print(lastlink)
        url2 == lastlink

trade_spider(3)

score 1 · Answer 1 · edited May 23 '17 at 11:54

I wouldn't think of it in terms of "clicking" a link, since you're writing a script, and not using a browser.

What you need is to figure out 4 things:

Given a url, how do you get the HTML behind it and parse it with beautifulSoup - it sounds like you've got this part down already. :)
Given many different htmls, how do you identify the images you want to download and the "next" link. - Once again, beautifulSoup.
Given a url of an image (found in the "src" attribute of <img> tags), how do you save the image to disk. Answers can be found in StackOverflow questions like these: Downloading a picture via urllib and python
Given a url of a "next" link, how do you "click" on it - Once again, you're not really "clicking" you just download the HTML from this new link and start the entire cycle once again (parse it, identify the image and the "next" link", download the image, fetch HTML behind "next" link).

Once you've broken the problem down, all that's left is to assemble everything in one nice script, and you're done.

Good luck :)

hey, thanks for your reply, I think i am ok with first 3 points you mentioned above, and Im sure the 4th step is the problem here... I think i will be unable to sort it out myself, or at least it gonna take too much time... I have updated the Original post with my current script, if you have time, you could tell me whats wrong... — Zia Lvinas, Feb 17 '16 at 12:21
Why not use "requests.get(url)" on the next link? If you would just move the first row inside your While loop outside of your while loop it might work (right now, each iteration of the while loop makes you parse the same URL). — DougieHauser, Feb 17 '16 at 12:47

score 1 · Answer 2 · answered Feb 17 '16 at 13:32

It's fixed. DougieHauser was right and I want to shake his hand for that.

I just moved url2 row outside of while loop and it's seems to work just fine, now all I need is to figure out how to make this script to save pictures on my hdd LOL

def trade_spider(max_pages):
    url2 = 'http://linksmiau.net/linksmi_paveiksliukai/rimtas_rudeninis_ispejimas_merginoms/1819/'
    page = 1
    while page <= max_pages:
#url2 = 'http://linksmiau.net/linksmi_paveiksliukai/rimtas_rudeninis_ispejimas_merginoms/1819/'
        url = url2
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        #current_bet_id = "event_odd_id_31362885" #+ str(5)

        #for link in soup.findAll('span', {'class': 'game'}, itemprop="name"):

Simple Python Image Scraper Script

2 Answers2