How to fix the order problem when using scrapy?

Question

I believe this is a simple one, and I am willing to learn more. The thing is that I want to crawl the website titles via URL. The purpose of this is predicting the online news popularity and the data is from the UCI Machine Learning Repository. Here's the link.

I follow the tutorial of Scrapy and change the code in "quotes spider" as following. After I run "scrapy crawl quotes" in the terminal, I used "scrapy crawl quotes -o quotes.json" to save all the title in JSON.

There are 158 missing. I have 39,486 URL but 39,644 Website Titles. In addition, the order of each website does not fit each URL. For example, The final Title corresponds to the third last URL. Could you please help me identify the problems?

Here's the Result

I tried to use "Beautiful soup" in Jupyter Notebook, but it was slow and cannot tell if the code is still running or not.

import scrapy
import pandas as pd


df = pd.read_csv("/Users/.../OnlineNewsPopularity.csv",delim_whitespace=False)
url = df['url']

class QuotesSpider(scrapy.Spider):
    name = "quotes"    
    start_urls = url.values.tolist()

    def parse(self, response):
        for quote in response.css('h1.title'):
            yield {
                'Title': quote.css('h1.title::text').extract_first(),
            }

score 0 · Accepted Answer · answered Apr 01 '19 at 06:59

If your aim is only to keep the correspondence between URL and title, you can add the URL to your scraped item:

def parse(self, response):
    for quote in response.css('h1.title'):
        yield {
            'Title': quote.css('h1.title::text').extract_first(),              
            'url': response.url
        }

On the contrary, if you want to process URLs in order, there are various ways, a bit more complex. The most common idea is to write a method start_request, where you request only the first URL; then, in the method parse, you request the second URL, setting the same method (parse) as callback; and so on...

See Sequential scraping from multiple start_urls leading to error in parsing and Scrapy Crawl URLs in Order

It would do the work, many thanks! After crawling the Website Title, just need to merge them together. — Chi, Apr 01 '19 at 19:08

How to fix the order problem when using scrapy?

1 Answers1