1

I've little bug with scrapy and Pillow. Know they've many "same" question but I try all I find and it's not works..

I use scrapy to parse many website, more than 100 000 webpages. I've created a pipeline that define if page contains image, and if, it download picture and create thumbail on same path. Use it because if creation of thumbail fail, I've "big" version of image.

Here some code

from PIL import Image
from slugify import slugify

class DownloadImageOnDisk( object ):
    def process_item( self, item, spider ):
        try:
            # If image on page
            if item[ 'image' ]:
                img     = item[ 'image' ]
                # Get extension of image
                ext     = img.split( '.' )
                ext     = ext[ -1 ].split('?')
                ext     = ext[0]
                key     = self.remove_accents( item[ 'imagetitle' ] ).encode( 'utf-8', 'replace' )
                path    = settings[ 'IMG_PATH' ] + item[ 'website' ] + '/' + key + '.' + ext

                # Create dir
                if not os.path.exists( settings['IMG_PATH'] + item['website'] ):
                    os.makedirs( settings[ 'IMG_PATH' ] + item[ 'website' ] )

                # Check if image not already exist
                if not os.path.isfile( path ):
                    # Download big image
                    urllib.urlretrieve( img, path )
                    if os.path.isfile( path ):
                        # Create thumb
                        self.optimize_image( path )

                item[ 'image' ] = item[ 'website' ] + '/' + key + '.' + ext

            return item
        except Exception as exc:
            pass

    # Slugify path
    def remove_accents( self, input_str ):
        try:
            return slugify( input_str )
        except Exception as exc:
            raise DropItem( exc )

    # Create thumb
    def optimize_image( self, path ):
        try:
            image = Image.open( path )
            image.thumbnail( ( 200,200 ), Image.ANTIALIAS )
            image.save( path, optimize=True, quality=85 )
        except IOError  as exc:
            raise DropItem( exc )
        except Exception as exc:
            raise DropItem( exc )

But sometimes, not regulary (one for 100 items I thinks) I've this error

cannot identify image file '/PATH/NAME.jpg'

On optimize_image function. When I check on disk I image exist, it already do.

I really not understand..

I you've any suggestion.

Thanks in advance

  • Did you see this one? http://stackoverflow.com/questions/19230991/image-open-cannot-identify-image-file-python – selllikesybok May 08 '15 at 01:39
  • Yes, already have `from PIL import Image` and I've all decoder for PIL (JPEG, PNG, GIF, etc..). I've also test with `io.BytesIO( fd.read() )` but already not works.. And When I do `pip freeze | grep -E '(Pillow|PIL)'` on console, I've only _Pillow==2.8.1_ – magexcustomer May 08 '15 at 01:45

1 Answers1

1

Not sure but it seems to be resolve with

import requests
import io
...
response = requests.get( img )
image = Image.open(io.BytesIO(response.content))
image.thumbnail( ( 200,200 ), Image.ANTIALIAS )
image.save( path, optimize=True, quality=85 )

I continue my test