25

I am trying to extract and download all images from a url. I wrote a script

import urllib2
import re
from os.path import basename
from urlparse import urlsplit

url = "http://filmygyan.in/katrina-kaifs-top-10-cutest-pics-gallery/"
urlContent = urllib2.urlopen(url).read()
# HTML image tag: <img src="url" alt="some_text"/>
imgUrls = re.findall('img .*?src="(.*?)"', urlContent)

# download all images
for imgUrl in imgUrls:
    try:
        imgData = urllib2.urlopen(imgUrl).read()
        fileName = basename(urlsplit(imgUrl)[2])
        output = open(fileName,'wb')
        output.write(imgData)
        output.close()
    except:
        pass

I don't want to extract image of this page see this image http://i.share.pho.to/1c9884b1_l.jpeg I just want to get all the images without clicking on "Next" button I am not getting how can I get the all pics within "Next" class.?What changes I should do in findall?

user2711817
  • 267
  • 1
  • 3
  • 6
  • 1
    You'd like to use BeautifulSoup but are unsure how to proceed? – Jon Clements Aug 23 '13 at 18:16
  • Yes.I am not sure how should I use findall or findnext? Above script will grab all the images of that url but what I want (see the image link) to grab all the images of that slideshow which are coming after clicking next button. – user2711817 Aug 24 '13 at 20:37
  • Use [wget](http://stackoverflow.com/questions/4602153/how-do-i-use-wget-to-download-all-images-into-a-single-folder) – Burhan Khalid Aug 30 '13 at 21:29
  • Tell me one thing why do you want to download images from filmygyan? Then, I can give you the solution of your query..! – Khan Aug 30 '13 at 21:21
  • @khan nothing special.I am just learning. – user2711817 Aug 31 '13 at 14:35

4 Answers4

45

The following should extract all images from a given page and write it to the directory where the script is being run.

import re
import requests
from bs4 import BeautifulSoup

site = 'http://pixabay.com'

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]


for url in urls:
    filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
    if not filename:
         print("Regex didn't match with the url: {}".format(url))
         continue
    with open(filename.group(1), 'wb') as f:
        if 'http' not in url:
            # sometimes an image source can be relative 
            # if it is provide the base url which also happens 
            # to be the site variable atm. 
            url = '{}{}'.format(site, url)
        response = requests.get(url)
        f.write(response.content)
Jonathan
  • 8,453
  • 9
  • 51
  • 74
  • 1
    is it saving images in folder ? – Shubham Sharma May 11 '18 at 11:47
  • 3
    'NoneType' object has no attribute 'group' – Mostafa Sep 25 '19 at 07:22
  • 1
    To reply to you Mostafa, I added a try and except statement, and that seemed to resolve the issue at least for me. I am still unable to get windows media viewer to see the images though.... – Henri De Boever Apr 24 '20 at 19:13
  • Well, NoneType object has no attribute 'group' just means that no match was made with the regex. I made an amendment that prints out the url that didn't match. – Jonathan Apr 25 '20 at 08:59
  • 1
    Hello Jonathan, thank you for the update on the code to clear that up. Is there any reason that after the images have been downloaded that they cannot be accessed? – Henri De Boever Apr 28 '20 at 18:41
  • There is probably a reason for that, but I can't really tell without seeing the file / see the website where you tried it out. – Jonathan Apr 29 '20 at 07:24
4

Slight modification to Jonathan's answer (because I can't comment): adding 'www' to the website will fix most "File Type Not Supported" errors.

import re
import requests
from bs4 import BeautifulSoup

site = 'http://www.google.com'

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]


for url in urls:
    filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
    if not filename:
         print("Regex didn't match with the url: {}".format(url))
         continue
    with open(filename.group(1), 'wb') as f:
        if 'http' not in url:
            # sometimes an image source can be relative 
            # if it is provide the base url which also happens 
            # to be the site variable atm. 
            url = '{}{}'.format(site, url)
        response = requests.get(url)
        f.write(response.content)
Mac
  • 134
  • 1
  • 7
0
from bs4 import *
import requests
import os


def folder_create(images):
    try:
        folder_name = input("Enter Folder Name:- ")
        # folder creation
        os.mkdir(folder_name)

    
    except:
        print("Folder Exist with that name!")
        folder_create()

    
    download_images(images, folder_name)


def download_images(images, folder_name):
    count = 0
    print(f"Total {len(images)} Image Found!")
    if len(images) != 0:
        for i, image in enumerate(images):          
            try:    
                image_link = image["data-srcset"]
            except:
                try:
                    
                    image_link = image["data-src"]
                except:
                    try:
                        
                        image_link = image["data-fallback-src"]
                    except:
                        try:
                            
                            image_link = image["src"]

                        
                        except:
                            pass

            
            
            try:
                r = requests.get(image_link).content
                try:

                    # possibility of decode
                    r = str(r, 'utf-8')

                except UnicodeDecodeError:

                    with open(f"{folder_name}/images{i+1}.jpg", "wb+") as f:
                        f.write(r)
                    count += 1
            except:
                pass
        
        
        if count == len(images):
            print("All Images Downloaded!")
            
        else:
            print(f"Total {count} Images Downloaded Out of {len(images)}")

def main(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    images = soup.findAll('img')
    folder_create(images)

url = input("Enter URL:- ")
main(url)`
Nuhman Pk
  • 112
  • 7
-8

If you want only pictures then you can just download them without even scrapping the webpage. The all have the same URL:

http://filmygyan.in/wp-content/gallery/katrina-kaifs-top-10-cutest-pics-gallery/cute1.jpg
http://filmygyan.in/wp-content/gallery/katrina-kaifs-top-10-cutest-pics-gallery/cute2.jpg
...
http://filmygyan.in/wp-content/gallery/katrina-kaifs-top-10-cutest-pics-gallery/cute10.jpg

So simple code as that will give you all images:

import os
import urllib
import urllib2


baseUrl = "http://filmygyan.in/wp-content/gallery/katrina-kaifs-top-10-"\
      "cutest-pics-gallery/cute%s.jpg"

for i in range(1,11):
    url = baseUrl % i
    urllib.urlretrieve(url, os.path.basename(url))

With Beautifulsoup you will have to click or go to the next page to scrap the images. If you want ot scrap each page individually try to scrathem using there class which is shutterset_katrina-kaifs-top-10-cutest-pics-gallery

4d4c
  • 8,049
  • 4
  • 24
  • 29
  • but your script will not work in this case.See if the url is http://filmygyan.in/tamannah-bhatia-spotted-sizzling-hot-at-tv-channel-launch/ because here url is changing randomely between sexy112.jpg ,sexy117.jpg, sexy12.jpg. because if i range it from (1,117) it will also download the garbage value. – user2711817 Aug 25 '13 at 10:45
  • So you are using different URL? That's completely different question. If you need to get all images from the new URL, open another question. If you want to make script that will work for all pages on your site, then you will have to supply your **NEW** question with all required information (like what classes, ids or tags are used on each page) – 4d4c Aug 26 '13 at 20:51
  • okey.I thought this script is going to work for all urls because I checked it on some urls but after 2 or 3 url I got stucked becuase this time url was not following the pattern like (1,12) (1,20).Looks like I have to post another Question to get all images from any url for this. – user2711817 Aug 27 '13 at 13:32
  • Yes, you do. But do you know how many URLs you will have, from which you want to download images? I think there is a patter with which you can make script that will work for all pages from those URLs – 4d4c Aug 27 '13 at 13:34
  • Yes I am trying to figure this pattern.Maybe I should look for that "div" in which all images are contained. – user2711817 Aug 27 '13 at 14:07
  • Give all URLs from which you want to gather photos. Then I can help you – 4d4c Aug 27 '13 at 14:16