9

I have this folder structure:

app.py # flask app
app/
   datafoo/
          scrapy.cfg
          crawler.py
          blogs/
                pipelines.py 
                settings.py
                middlewares.py
                items.py
                spiders/                    
                        allmusic_feed.py
                        allmusic_data/
                                      delicate_tracks.jl
                                     
                                     
                      

                
          

scrapy.cfg:

[settings]
default = blogs.settings

allmusic_feed.py:

   class AllMusicDelicateTracks(scrapy.Spider): # one amongst many spiders
        name = "allmusic_delicate_tracks"
        allowed_domains = ["allmusic.com"]
        start_urls = ["http://web.archive.org/web/20160813101056/http://www.allmusic.com/mood/delicate-xa0000000972/songs",             
        ]
        def parse(self, response):
    
            for sel in response.xpath('//tr'):
                item = AllMusicItem()
                item['artist'] = sel.xpath('.//td[@class="performer"]/a/text()').extract_first() 
                item['track'] = sel.xpath('.//td[@class="title"]/a/text()').extract_first()
                yield item

crawler.py:

from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings 


def blog_crawler(self, mood):

        item, jl = mood  # ITEM = SPIDER
        process = CrawlerProcess(get_project_settings())
        process.crawl(item, domain='allmusic.com')
        process.start() 
        allmusic = []
        allmusic_tracks = []
        allmusic_artists = []
        try:
            # jl is file where crawled data is stored
            with open(jl, 'r+') as t:
                for line in t:
                    allmusic.append(json.loads(line))
        except Exception as e:
            print (e, 'try another mood')

        for item in allmusic:
            allmusic_artists.append(item['artist'])
            allmusic_tracks.append(item['track'])
        return zip(allmusic_tracks, allmusic_artists)

app.py :

@app.route('/tracks', methods=['GET','POST'])
def tracks(name):
    from app.datafoo import crawler

    c = crawler()
    mood = ['allmusic_delicate_tracks', 'blogs/spiders/allmusic_data/delicate_tracks.jl']
    results = c.blog_crawler(mood)
    return results

if simply run the app with python app.py, I get the following error:

ValueError: signal only works in main thread

when I run the app with gunicorn -c gconfig.py app:app --log-level=debug --threads 2 , it just hangs there:

127.0.0.1 - - [29/Jan/2018:03:40:36 -0200] "GET /tracks HTTP/1.1" 500 291 "http://127.0.0.1:8080/menu" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"

lastly, running with gunicorn -c gconfig.py app:app --log-level=debug --threads 2 --error-logfile server.log, I get:

server.log

[2018-01-30 13:41:39 -0200] [4580] [DEBUG] Current configuration:
  proxy_protocol: False
  worker_connections: 1000
  statsd_host: None
  max_requests_jitter: 0
  post_fork: <function post_fork at 0x1027da848>
  errorlog: server.log
  enable_stdio_inheritance: False
  worker_class: sync
  ssl_version: 2
  suppress_ragged_eofs: True
  syslog: False
  syslog_facility: user
  when_ready: <function when_ready at 0x1027da9b0>
  pre_fork: <function pre_fork at 0x1027da938>
  cert_reqs: 0
  preload_app: False
  keepalive: 5
  accesslog: -
  group: 20
  graceful_timeout: 30
  do_handshake_on_connect: False
  spew: False
  workers: 16
  proc_name: None
  sendfile: None
  pidfile: None
  umask: 0
  on_reload: <function on_reload at 0x10285c2a8>
  pre_exec: <function pre_exec at 0x1027da8c0>
  worker_tmp_dir: None
  limit_request_fields: 100
  pythonpath: None
  on_exit: <function on_exit at 0x102861500>
  config: gconfig.py
  logconfig: None
  check_config: False
  statsd_prefix: 
  secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
  reload_engine: auto
  proxy_allow_ips: ['127.0.0.1']
  pre_request: <function pre_request at 0x10285cde8>
  post_request: <function post_request at 0x10285ced8>
  forwarded_allow_ips: ['127.0.0.1']
  worker_int: <function worker_int at 0x1027daa28>
  raw_paste_global_conf: []
  threads: 2
  max_requests: 0
  chdir: /Users/me/Documents/Code/Apps/app
  daemon: False
  user: 501
  limit_request_line: 4094
  access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
  certfile: None
  on_starting: <function on_starting at 0x10285c140>
  post_worker_init: <function post_worker_init at 0x10285c848>
  child_exit: <function child_exit at 0x1028610c8>
  worker_exit: <function worker_exit at 0x102861230>
  paste: None
  default_proc_name: app:app
  syslog_addr: unix:///var/run/syslog
  syslog_prefix: None
  ciphers: TLSv1
  worker_abort: <function worker_abort at 0x1027daaa0>
  loglevel: debug
  bind: ['127.0.0.1:8080']
  raw_env: []
  initgroups: False
  capture_output: False
  reload: False
  limit_request_field_size: 8190
  nworkers_changed: <function nworkers_changed at 0x102861398>
  timeout: 120
  keyfile: None
  ca_certs: None
  tmp_upload_dir: None
  backlog: 2048
  logger_class: gunicorn.glogging.Logger
[2018-01-30 13:41:39 -0200] [4580] [INFO] Starting gunicorn 19.7.1
[2018-01-30 13:41:39 -0200] [4580] [DEBUG] Arbiter booted
[2018-01-30 13:41:39 -0200] [4580] [INFO] Listening at: http://127.0.0.1:8080 (4580)
[2018-01-30 13:41:39 -0200] [4580] [INFO] Using worker: threads
[2018-01-30 13:41:39 -0200] [4580] [INFO] Server is ready. Spawning workers
[2018-01-30 13:41:39 -0200] [4583] [INFO] Booting worker with pid: 4583
[2018-01-30 13:41:39 -0200] [4583] [INFO] Worker spawned (pid: 4583)
[2018-01-30 13:41:39 -0200] [4584] [INFO] Booting worker with pid: 4584
[2018-01-30 13:41:39 -0200] [4584] [INFO] Worker spawned (pid: 4584)
[2018-01-30 13:41:39 -0200] [4585] [INFO] Booting worker with pid: 4585
[2018-01-30 13:41:39 -0200] [4585] [INFO] Worker spawned (pid: 4585)
[2018-01-30 13:41:40 -0200] [4586] [INFO] Booting worker with pid: 4586
[2018-01-30 13:41:40 -0200] [4586] [INFO] Worker spawned (pid: 4586)
[2018-01-30 13:41:40 -0200] [4587] [INFO] Booting worker with pid: 4587
[2018-01-30 13:41:40 -0200] [4587] [INFO] Worker spawned (pid: 4587)
[2018-01-30 13:41:40 -0200] [4588] [INFO] Booting worker with pid: 4588
[2018-01-30 13:41:40 -0200] [4588] [INFO] Worker spawned (pid: 4588)
[2018-01-30 13:41:40 -0200] [4589] [INFO] Booting worker with pid: 4589
[2018-01-30 13:41:40 -0200] [4589] [INFO] Worker spawned (pid: 4589)
[2018-01-30 13:41:40 -0200] [4590] [INFO] Booting worker with pid: 4590
[2018-01-30 13:41:40 -0200] [4590] [INFO] Worker spawned (pid: 4590)
[2018-01-30 13:41:40 -0200] [4591] [INFO] Booting worker with pid: 4591
[2018-01-30 13:41:40 -0200] [4591] [INFO] Worker spawned (pid: 4591)
[2018-01-30 13:41:40 -0200] [4592] [INFO] Booting worker with pid: 4592
[2018-01-30 13:41:40 -0200] [4592] [INFO] Worker spawned (pid: 4592)
[2018-01-30 13:41:40 -0200] [4595] [INFO] Booting worker with pid: 4595
[2018-01-30 13:41:40 -0200] [4595] [INFO] Worker spawned (pid: 4595)
[2018-01-30 13:41:40 -0200] [4596] [INFO] Booting worker with pid: 4596
[2018-01-30 13:41:40 -0200] [4596] [INFO] Worker spawned (pid: 4596)
[2018-01-30 13:41:40 -0200] [4597] [INFO] Booting worker with pid: 4597
[2018-01-30 13:41:40 -0200] [4597] [INFO] Worker spawned (pid: 4597)
[2018-01-30 13:41:40 -0200] [4598] [INFO] Booting worker with pid: 4598
[2018-01-30 13:41:40 -0200] [4598] [INFO] Worker spawned (pid: 4598)
[2018-01-30 13:41:40 -0200] [4599] [INFO] Booting worker with pid: 4599
[2018-01-30 13:41:40 -0200] [4599] [INFO] Worker spawned (pid: 4599)
[2018-01-30 13:41:40 -0200] [4600] [INFO] Booting worker with pid: 4600
[2018-01-30 13:41:40 -0200] [4600] [INFO] Worker spawned (pid: 4600)
[2018-01-30 13:41:40 -0200] [4580] [DEBUG] 16 workers
[2018-01-30 13:41:47 -0200] [4583] [DEBUG] GET /menu
[2018-01-30 13:41:54 -0200] [4584] [DEBUG] GET /tracks

NOTE:

in this SO answer I've learned that in order to integrate Flask and Scrapy you can either use:

1. Python subprocess

2. Twisted-Klein + Scrapy

3. ScrapyRT

but I haven't had any luck adapting my specific code to these solutions.

I reckon a subprocess would be simpler and suffice, because user experience rarely requires a scraping thread, but am not sure.

could anyone please point me in the right direction here?

Community
  • 1
  • 1
  • I think you might be unable to create a user specified Worker process from within GUnicorn. However , I am not sure, as i'm more used to traditional Nginx/wsgi modes. What is CrawlerProcess in you code up there? Is it somehow provided as a template with GUnicorn? – rostamn739 Jan 27 '18 at 20:14
  • I've edit with CrawlerProcess import. is that what you mean? –  Jan 27 '18 at 20:18
  • as of the 500 error, you need to print out the specific error – ospider Jan 29 '18 at 05:39
  • @ospider I've edited. thats all I got. –  Jan 29 '18 at 05:43
  • I mean you have to modify your program to log the specific error – ospider Jan 29 '18 at 06:00
  • I'm afraid I dont know how to do that. at gunicorn.py? if I run with $ python app.py, I get an signal error, telling me spider must be on the main thread, which is the reason I've used the flag --thread 2 –  Jan 29 '18 at 06:06
  • please refer to edit. –  Jan 29 '18 at 06:13
  • use ```gunicorn -c gconfig.py app:app --log-level=debug --threads 2 --error-logfile somewhere.log``` and show us 'somewhere.log' result, then we can help you with the 500 error. – Sepehr Hamzehlooy Jan 30 '18 at 10:18

1 Answers1

6

Here's a minimal example how you can do it with ScrapyRT.

This is the project structure:

project/
├── scraping
│   ├── example
│   │   ├── __init__.py
│   │   ├── items.py
│   │   ├── middlewares.py
│   │   ├── pipelines.py
│   │   ├── settings.py
│   │   └── spiders
│   │       ├── __init__.py
│   │       └── quotes.py
│   └── scrapy.cfg
└── webapp
    └── example.py

scraping directory contains the Scrapy project. This project contains one spider quotes.py to scrape some quotes from quotes.toscrape.com:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals

import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        for quote in response.xpath('//div[@class="quote"]'):
            yield {
                'author': quote.xpath('.//small[@class="author"]/text()').extract_first(),
                'text': quote.xpath('normalize-space(./span[@class="text"])').extract_first()
            }

In order to start ScrapyRT and listen to requests for scraping, go to the Scrapy project's directory scraping and issue scrapyrt command:

$ cd ./project/scraping
$ scrapyrt

ScrapyRT will now listen on localhost:9080.

webapp directory contains simple Flask app that scrapes quotes on demand (using the spider above) and simply displays them to user:

from __future__ import unicode_literals

import json
import requests

from flask import Flask

app = Flask(__name__)

@app.route('/')
def show_quotes():
    params = {
        'spider_name': 'quotes',
        'start_requests': True
    }
    response = requests.get('http://localhost:9080/crawl.json', params)
    data = json.loads(response.text)
    result = '\n'.join('<p><b>{}</b> - {}</p>'.format(item['author'], item['text'])
                       for item in data['items'])
    return result

To start the app:

$ cd ./project/webapp
$ FLASK_APP=example.py flask run

Now when you point the browser on localhost:5000, you'll the list of quotes freshly scraped from quotes.toscrape.com.

Tomáš Linhart
  • 9,832
  • 1
  • 27
  • 39
  • thanks for your answer, very useful. I've followed your example, but I am getting `for item in data['items']) KeyError: u'items'`. Also, it would be more useful if your answer could contemplate a working example based on my scraping code, because it works outside flask. –  Jan 31 '18 at 02:50
  • I provided such minimal example intentionally so that all the important stuff is clear. And IMHO the folder structure is not that far from yours or can be very easily adapted. When it comes to the error you are getting. Try to import `pprint` and display the content of `data` with `pprint.pprint(data)` in `show_quotes`. It seems there's either a problem communicating with ScrapyRT or crawler didn't return any items. – Tomáš Linhart Jan 31 '18 at 06:41
  • when I run your example on my spiders, Im getting `TypeError: 'dict' object is not callable` –  Feb 01 '18 at 04:30
  • and when I chenge to pprint data, I get `)` –  Feb 01 '18 at 04:52
  • `2018-02-01 02:47:58-0200 [-] "127.0.0.1" - - [01/Feb/2018:04:47:57 +0000] "GET /crawl.json?spider_name=allmusic_warm_tracks&start_requests=True HTTP/1.1" 500 48 "-" "python-requests/2.18.4"` –  Feb 01 '18 at 05:00
  • Where do you get the error `TypeError: 'dict' object is not callable`? There are multiple components involved so you have to be explicit. I'd guess the problem is in the spider itself, as from the rest of your comments, it seems that Flask and ScrapyRT play well together. – Tomáš Linhart Feb 01 '18 at 06:45
  • When I got to ```http://localhost:5000``` it is empty for me and I have followed your steps. How do I get output here? – Emil11 Jul 01 '22 at 17:06