I've little bug with scrapy and Pillow. Know they've many "same" question but I try all I find and it's not works..
I use scrapy to parse many website, more than 100 000 webpages. I've created a pipeline that define if page contains image, and if, it download picture and create thumbail on same path. Use it because if creation of thumbail fail, I've "big" version of image.
Here some code
from PIL import Image
from slugify import slugify
class DownloadImageOnDisk( object ):
def process_item( self, item, spider ):
try:
# If image on page
if item[ 'image' ]:
img = item[ 'image' ]
# Get extension of image
ext = img.split( '.' )
ext = ext[ -1 ].split('?')
ext = ext[0]
key = self.remove_accents( item[ 'imagetitle' ] ).encode( 'utf-8', 'replace' )
path = settings[ 'IMG_PATH' ] + item[ 'website' ] + '/' + key + '.' + ext
# Create dir
if not os.path.exists( settings['IMG_PATH'] + item['website'] ):
os.makedirs( settings[ 'IMG_PATH' ] + item[ 'website' ] )
# Check if image not already exist
if not os.path.isfile( path ):
# Download big image
urllib.urlretrieve( img, path )
if os.path.isfile( path ):
# Create thumb
self.optimize_image( path )
item[ 'image' ] = item[ 'website' ] + '/' + key + '.' + ext
return item
except Exception as exc:
pass
# Slugify path
def remove_accents( self, input_str ):
try:
return slugify( input_str )
except Exception as exc:
raise DropItem( exc )
# Create thumb
def optimize_image( self, path ):
try:
image = Image.open( path )
image.thumbnail( ( 200,200 ), Image.ANTIALIAS )
image.save( path, optimize=True, quality=85 )
except IOError as exc:
raise DropItem( exc )
except Exception as exc:
raise DropItem( exc )
But sometimes, not regulary (one for 100 items I thinks) I've this error
cannot identify image file '/PATH/NAME.jpg'
On optimize_image function. When I check on disk I image exist, it already do.
I really not understand..
I you've any suggestion.
Thanks in advance