1

So I ran a crawler last week and produced a CSV file that lists all the image URLs I need for my project. After reading the CSV to a python list, I was unsure how to use Scrapy to simply download them through a pipeline. I've tried many things and recently I got it to work, but it's ugly and not quite right. For my list of 10 image URLs, Scrapy finishes the scrape with 20 requests made even tho 10 images were correctly stored. I am probably doing something stupid because I am fairly new to Scrapy, but I've read through most of Scrapy's documentation and quite a bit of trial and error with google results.

I simply want Scrapy to send one request per URL and download the corresponding image. Any help would be appreciated. I have banged my head against this for 3 days. My code:

spider.py

import scrapy
import csv
import itertools
from ..items import ImgItem

urls=[]
with open('E:/Chris/imgUrls.csv') as csvDataFile:
    csvReader = csv.reader(csvDataFile)
    for elem in itertools.islice(csvReader, 0, 10):
        urls.append(elem[0])                #Just doing first 10 for testing
                                            #My Csv file is not the problem
                                            # ...1 url per row
class DwImgSpider(scrapy.Spider):
    name = 'dw-img'
    start_urls = urls

    def parse(self, response):
        item = ImgItem()
        img_urls = urls
        item['image_urls'] = img_urls
        return item

If you want to see additional files, I can edit this to add them. I just figured this was where the problem came from since it does technically work. Thanks again, appreciate any help or redirects.

Chris4542
  • 35
  • 7
  • what do you have in `start_urls` ? if urls to images then maybe first it gets from server because you put them in `start_urls` and it treats them as pages. And later it uses pipeline to download it as image. Frankly, I wouldn't use `Scrapy` for this but `requests` or `urllib.request` – furas Jun 28 '20 at 00:57
  • Yes I’ve used Requests successfully with it, but it is far too slow for the amount of URLs I’m going through, even after threading. So I figured I could either learn Asyncio and use that, or use Scrapy which integrates very nicely with my proxy manager, Crawlera. The problem with Async is that I read it doesn’t work with https proxies. But I haven’t dug too deep into it. Anyways, in start_urls, it’s a list of direct .jpg URLs (https..website.../blahblah.jpg) I guess the question would be, is there a way to not include start_urls at all? Otherwise yea Scrapy is not ideal for this – Chris4542 Jun 28 '20 at 02:01
  • I would add only one url in `start_urls` (ie. to main page `https..website.../`) and later you already add all images in `item['image_urls'] = urls` – furas Jun 28 '20 at 02:07
  • Oh wow. I’m going to try that right now. If that works I’m going to be excited but angry at myself. I’ll let you know – Chris4542 Jun 28 '20 at 02:09
  • BTW: there are modules [request-async](https://github.com/encode/requests-async) and [httpx](https://github.com/encode/httpx) but if you have access to Crawlera then Scrapy can be more useful :) – furas Jun 28 '20 at 02:18
  • 1
    Welp. That definitely fixed the number of requests problem. (Face palm). Thank you so much for that help sir. I’ll begin testing at scale. – Chris4542 Jun 28 '20 at 02:20

2 Answers2

1

Thanks to furas , I found that changing

start_urls = urls 

to

start_urls = ['<just one url, the main website>']

Fixed my number of requests problem! Thank you furas.

Chris4542
  • 35
  • 7
1

Another method.

import csv,os
import itertools
from simplified_scrapy import Spider, SimplifiedMain, utils
class ImageSpider(Spider):
  name = 'images'
  start_urls = []
  def __init__(self):
      with open('E:/Chris/imgUrls.csv') as csvDataFile:
          csvReader = csv.reader(csvDataFile)
          for elem in itertools.islice(csvReader, 0, 10):
              self.start_urls.append(elem[0])
      Spider.__init__(self,self.name) # Necessary
      if(not os.path.exists('images/')):
          os.mkdir('images/')
          
  def afterResponse(self, response, url, error=None, extra=None):
    try:
        utils.saveResponseAsFile(response,'images/','image')
    except Exception as err:
        print (err)
    return None 

SimplifiedMain.startThread(ImageSpider()) # Start download
dabingsou
  • 2,469
  • 1
  • 5
  • 8