0

I have multiple URLs to scrape stored in a csv file where each row is a separate URL and I'm using this code to run it

     def start\\_requests(self): 

             with open('csvfile', 'rb') as f: 

                      list=[] 

                      for line in f.readlines(): 

                             array = line.split(',')

                             url = array[9] 

                             list.append(url) 

                    list.pop(0)
             for url in list:
                    if url != "": 

                          yield scrapy.Request(url=url, callback=self.parse) 

It gives me the following error IndexError: list index out of range, can anyone help me correct this or suggest another way to use that csv file?

edit: csv file looks like this:

http://example.org/page1
http://example.org/page2

there are 9 such rows

A67John
  • 103
  • 7
  • Would it be possible to share some of your csv file to help find what the issue is. `IndexError: list index out of range` most likely suggests that the cause may be due to `url = array[9]` – Ryan Jul 20 '20 at 18:09
  • It is literally a csv file where each row is an URL, no extra signs, no separators, nothing, and there are 9 rows for test purposes – A67John Jul 20 '20 at 18:12
  • Edited the question to show the csv file – A67John Jul 20 '20 at 18:18

1 Answers1

1

You should be able to do this by reading the csv file into a list variable without having to do some of the code above. Therefore no need to split, pop and append

Working example

import csv
import scrapy
from scrapy.crawler import CrawlerProcess


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        with open('websites.csv') as csv_file:
            data = csv.reader(csv_file)
            for row in data:
                # Supposing that the data is in the first column
                url = row[0]
                if url != "":
                    # We need to check this has the http prefix or we get a Missing scheme error
                    if not url.startswith('http://') and not url.startswith('https://'):
                        url = 'https://' + url
                    yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        # Do my data extraction
        print("test")


if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
    })
    c.crawl(QuotesSpider)
    c.start()
Ryan
  • 2,167
  • 2
  • 28
  • 33
  • 1
    It almost works perfectly, since all the urls start with https it turned them into https:// http// example .com/site1 (without spaces), but after getting rid of part of the prefix check it works fine, thank you – A67John Jul 20 '20 at 18:43
  • You are correct, my prefix check should be an `and`. I'll update it now. Anyway, glad it worked – Ryan Jul 20 '20 at 18:46
  • If it's not inconvenient to you would you mind explaining or point me towards a source with an explanation why last part is necessary? if __name__ part – A67John Jul 20 '20 at 18:54
  • 1
    That part is just used to run a spider as a single python script and not via the `scrapy crawl` command. It is mentioned in the docs: https://docs.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script. the `if __name__ == "__main__"` part is general python. You can find an explanation here: https://stackoverflow.com/questions/28336627/if-name-main-python – Ryan Jul 20 '20 at 18:59