0

I use Scrapy to crawl and scrap StackOverflow.com . This is so.py

import scrapy

class StackOverflowSpider(scrapy.Spider):
    name = 'stackoverflow'
    start_urls = ['http://stackoverflow.com']

    def parse(self, response):
        for href in response.css('.question-summary h3 a::attr(href)'):
            full_url = response.urljoin(href.extract())
            yield scrapy.Request(full_url, callback=self.parse_question)

    def parse_question(self, response):
        yield {
            'link': response.url,
        }

Expected result: so.json (valid JSON format)

[
   "http://stackoverflow.com/questions/36421917/exponential-number-in-custom-number-format-of-excel",
   "http://stackoverflow.com/questions/36421343/can-not-install-requirements-txt",
   "http://stackoverflow.com/questions/36418815/difference-between-two-approaches-to-pass-parameters-to-web-server",
   "http://stackoverflow.com/questions/36421743/sharing-an-oracle-database-connection-between-simultaneous-celery-tasks",
   "http://stackoverflow.com/questions/36421941/jquery-add-css-style",
]

Then run:

scrapy runspider so.py -o so.json

The result isn't like above expected. I stuck at here.

Vy Do
  • 46,709
  • 59
  • 215
  • 313

1 Answers1

0

Try to use a FEED_FORMAT=jsonlines setting.

scrapy runspider so.py -o so.json --set FEED_FORMAT=jsonlines

If you want to get

[
   "https://stackoverflow.com/questions/36421917/exponential-number-in-custom-number-format-of-excel",
   "https://stackoverflow.com/questions/36421343/can-not-install-requirements-txt",
   "https://stackoverflow.com/questions/36418815/difference-between-two-approaches-to-pass-parameters-to-web-server",
   "https://stackoverflow.com/questions/36421743/sharing-an-oracle-database-connection-between-simultaneous-celery-tasks",
   "https://stackoverflow.com/questions/36421941/jquery-add-css-style",
]

you should write your own ItemExporter, see this question

Community
  • 1
  • 1
Danil
  • 4,781
  • 1
  • 35
  • 50
  • This is result after run the above command: https://gist.github.com/donhuvy/7f75e0cf30ab0fe2ba79069ffa328b31 It still not like my expected result. – Vy Do Apr 05 '16 at 10:09
  • After apply your revised answer, I run command, then I have result: https://gist.github.com/donhuvy/cc21a2a99b64fa367dbaec70f27b564c . It isn't expected result. Help me solve the problem! – Vy Do Apr 06 '16 at 01:39