2

I am making an api which return the JsonResponse as my text from the scrapy. When i run the scripts individually it runs perfectly. But when i try to integrate the scrapy script with python django i am not getting the output.

What i want is only return the response to the request(which in my case is POSTMAN POST request.

Here is the code which i am trying

from django.http import HttpResponse, JsonResponse
from django.views.decorators.csrf import csrf_exempt
import scrapy
from scrapy.crawler import CrawlerProcess


@csrf_exempt
def some_view(request, username):
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
        'LOG_ENABLED': 'false'
    })
    process_test = process.crawl(QuotesSpider)
    process.start()

    return JsonResponse({'return': process_test})


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/random',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        return response.css('.text::text').extract_first()

I am very new to python and django stuff.Any kind of help would be much appreciated.

fat potato
  • 503
  • 1
  • 9
  • 29
  • I don't use those libraries but hopefully I can help on the more "general" Python. 1- What does `process_test` do in `process_test = process.crawl(QuotesSpider)`? In Python it's ok to not assign a return value to anything. 2- I'm tempted to say try with an instance of the class, so like this: `process.crawl(QuotesSpider())`. – Guimoute Oct 17 '18 at 12:47
  • @Guimoute `process_test` is returning the json response to my request. and making this an instance not helped me much – fat potato Oct 17 '18 at 12:54
  • Sorry I'm dumb, I literally skipped a line while reading... – Guimoute Oct 17 '18 at 14:03
  • no problem glad you try to help – fat potato Oct 17 '18 at 14:43

1 Answers1

0

In your code, process_test is a CrawlerProcess, not the output from the crawling.

You need additional configuration to make your spider store its output "somewhere". See this SO Q&A about writing a custom pipeline.

If you just want to synchronously retrieve and parse a single page, you may be better off using requests to retrieve the page, and parsel to parse it.

Apalala
  • 9,017
  • 3
  • 30
  • 48