2

I have written a scrapy spider that I'm running inside of a django celery task. When I run the task using the command: python manage.py celery worker --loglevel=info from this tutorial the task runs in the terminal and it seems that the scrapy log begins to start but soon after the log begins to come up on the screen, it seems that the celery script takes over the terminal window. I'm still new to using celery so I can't tell what is happening to the task. Here is the code for the task.py script and the spider file (with code I got from an SO post)

tasks.py

from celery.registry import tasks
from celery.task import Task


from django.template.loader import render_to_string
from django.utils.html import strip_tags

from django.core.mail import EmailMultiAlternatives


from ticket_city_scraper.ticket_city_scraper.spiders.tc_spider import spiderCrawl
from celery import shared_task

@shared_task
def crawl():
    return spiderCrawl()

spider file (with relevant code at the bottom)

import scrapy
import re
import json
from scrapy.crawler import CrawlerProcess
from scrapy import Request
from scrapy.contrib.spiders import CrawlSpider , Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose
from comparison.ticket_city_scraper.ticket_city_scraper.items import ComparatorItem
from urlparse import urljoin

from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerRunner
from twisted.internet import reactor, defer
from scrapy.utils.log import configure_logging


from billiard import Process




bandname = raw_input("Enter bandname\n")
tc_url = "https://www.ticketcity.com/concerts/" + bandname + "-tickets.html" 

class MySpider3(CrawlSpider):
    handle_httpstatus_list = [416]
    name = 'comparator'
    allowed_domains = ["www.ticketcity.com"]

    start_urls = [tc_url]
    tickets_list_xpath = './/div[@class = "vevent"]'
    def create_link(self, bandname):
        tc_url = "https://www.ticketcity.com/concerts/" + bandname + "-tickets.html"  
        self.start_urls = [tc_url]
        #return tc_url      

    tickets_list_xpath = './/div[@class = "vevent"]'

    def parse_json(self, response):
        loader = response.meta['loader']
        jsonresponse = json.loads(response.body_as_unicode())
        ticket_info = jsonresponse.get('B')
        price_list = [i.get('P') for i in ticket_info]
        if len(price_list) > 0:
            str_Price = str(price_list[0])
            ticketPrice = unicode(str_Price, "utf-8")
            loader.add_value('ticketPrice', ticketPrice)
        else:
            ticketPrice = unicode("sold out", "utf-8")
            loader.add_value('ticketPrice', ticketPrice)
        return loader.load_item()

    def parse_price(self, response):
        print "parse price function entered \n"
        loader = response.meta['loader']
        event_City = response.xpath('.//span[@itemprop="addressLocality"]/text()').extract() 
        eventCity = ''.join(event_City) 
        loader.add_value('eventCity' , eventCity)
        event_State = response.xpath('.//span[@itemprop="addressRegion"]/text()').extract() 
        eventState = ''.join(event_State) 
        loader.add_value('eventState' , eventState) 
        event_Date = response.xpath('.//span[@class="event_datetime"]/text()').extract() 
        eventDate = ''.join(event_Date)  
        loader.add_value('eventDate' , eventDate)    
        ticketsLink = loader.get_output_value("ticketsLink")
        json_id_list= re.findall(r"(\d+)[^-]*$", ticketsLink)
        json_id=  "".join(json_id_list)
        json_url = "https://www.ticketcity.com/Catalog/public/v1/events/" + json_id + "/ticketblocks?P=0,99999999&q=0&per_page=250&page=1&sort=p.asc&f.t=s&_=1436642392938"
        yield scrapy.Request(json_url, meta={'loader': loader}, callback = self.parse_json, dont_filter = True) 

    def parse(self, response):
        """
        # """
        selector = HtmlXPathSelector(response)
        # iterate over tickets
        for ticket in selector.select(self.tickets_list_xpath):
            loader = XPathItemLoader(ComparatorItem(), selector=ticket)
            # define loader
            loader.default_input_processor = MapCompose(unicode.strip)
            loader.default_output_processor = Join()
            # iterate over fields and add xpaths to the loader
            loader.add_xpath('eventName' , './/span[@class="summary listingEventName"]/text()')
            loader.add_xpath('eventLocation' , './/div[@class="divVenue location"]/text()')
            loader.add_xpath('ticketsLink' , './/a[@class="divEventDetails url"]/@href')
            #loader.add_xpath('eventDateTime' , '//div[@id="divEventDate"]/@title') #datetime type
            #loader.add_xpath('eventTime' , './/*[@class = "productionsTime"]/text()')

            print "Here is ticket link \n" + loader.get_output_value("ticketsLink")
            #sel.xpath("//span[@id='PractitionerDetails1_Label4']/text()").extract()
            ticketsURL = "https://www.ticketcity.com/" + loader.get_output_value("ticketsLink")
            ticketsURL = urljoin(response.url, ticketsURL)
            yield scrapy.Request(ticketsURL, meta={'loader': loader}, callback = self.parse_price, dont_filter = True)

#Code to run spider from celery task script
class UrlCrawlerScript(Process):
    def __init__(self, spider):
        Process.__init__(self)
        settings = get_project_settings()
        self.crawler = Crawler(settings)
        self.crawler.configure()
        self.crawler.signals.connect(reactor.stop, signal = signals.spider_closed)
        self.spider = spider

    def run(self):
        self.crawler.crawl(self.spider)
        self.crawler.start()
        reactor.run()

def spiderCrawl():
   # settings = get_project_settings()
   # settings.set('USER_AGENT','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)')
   # process = CrawlerProcess(settings)
   # process.crawl(MySpider3)
   # process.start()
   spider = MySpider()
   crawler = UrlCrawlerScript(spider)
   crawler.start()
   crawler.join()

I'm trying to make the code such that the user can enter text into a form which will then be concatenated to a url but for now I'm using raw_input to get the user's input. Is there something that needs to be added to the code in order for the task to run completely? Any help/code would be appreciated, thanks.

EDIT:

Terminal Window after running the command

(trydjango18)elijah@elijah-VirtualBox:~/Desktop/trydjango18/src2/trydjango18$ python manage.py celery worker --loglevel=info
/home/elijah/Desktop/trydjango18/trydjango18/local/lib/python2.7/site-packages/django/core/management/base.py:259: RemovedInDjango19Warning: "requires_model_validation" is deprecated in favor of "requires_system_checks".
  RemovedInDjango19Warning)

/home/elijah/Desktop/trydjango18/trydjango18/local/lib/python2.7/site-packages/celery/app/defaults.py:251: CPendingDeprecationWarning: 
    The 'BROKER_VHOST' setting is scheduled for deprecation in     version 2.5 and removal in version v4.0.     Use the BROKER_URL setting instead

  alternative='Use the {0.alt} instead'.format(opt))

/home/elijah/Desktop/trydjango18/trydjango18/local/lib/python2.7/site-packages/celery/app/defaults.py:251: CPendingDeprecationWarning: 
    The 'BROKER_HOST' setting is scheduled for deprecation in     version 2.5 and removal in version v4.0.     Use the BROKER_URL setting instead

  alternative='Use the {0.alt} instead'.format(opt))

/home/elijah/Desktop/trydjango18/trydjango18/local/lib/python2.7/site-packages/celery/app/defaults.py:251: CPendingDeprecationWarning: 
    The 'BROKER_USER' setting is scheduled for deprecation in     version 2.5 and removal in version v4.0.     Use the BROKER_URL setting instead

  alternative='Use the {0.alt} instead'.format(opt))

/home/elijah/Desktop/trydjango18/trydjango18/local/lib/python2.7/site-packages/celery/app/defaults.py:251: CPendingDeprecationWarning: 
    The 'BROKER_PASSWORD' setting is scheduled for deprecation in     version 2.5 and removal in version v4.0.     Use the BROKER_URL setting instead

  alternative='Use the {0.alt} instead'.format(opt))

/home/elijah/Desktop/trydjango18/trydjango18/local/lib/python2.7/site-packages/celery/app/defaults.py:251: CPendingDeprecationWarning: 
    The 'BROKER_PORT' setting is scheduled for deprecation in     version 2.5 and removal in version v4.0.     Use the BROKER_URL setting instead

  alternative='Use the {0.alt} instead'.format(opt))

/home/elijah/Desktop/trydjango18/src2/trydjango18/comparison/ticket_city_scraper/ticket_city_scraper/spiders/tc_spider.py:6: ScrapyDeprecationWarning: Module `scrapy.contrib.spiders` is deprecated, use `scrapy.spiders` instead
  from scrapy.contrib.spiders import CrawlSpider , Rule

/home/elijah/Desktop/trydjango18/src2/trydjango18/comparison/ticket_city_scraper/ticket_city_scraper/spiders/tc_spider.py:9: ScrapyDeprecationWarning: Module `scrapy.contrib.loader` is deprecated, use `scrapy.loader` instead
  from scrapy.contrib.loader import ItemLoader

/home/elijah/Desktop/trydjango18/src2/trydjango18/comparison/ticket_city_scraper/ticket_city_scraper/spiders/tc_spider.py:11: ScrapyDeprecationWarning: Module `scrapy.contrib.loader.processor` is deprecated, use `scrapy.loader.processors` instead
  from scrapy.contrib.loader.processor import Join, MapCompose

Enter bandname
awolnation
/home/elijah/Desktop/trydjango18/trydjango18/local/lib/python2.7/site-packages/celery/apps/worker.py:161: CDeprecationWarning: 
Starting from version 3.2 Celery will refuse to accept pickle by default.

The pickle serializer is a security concern as it may give attackers
the ability to execute any command.  It's important to secure
your broker from unauthorized access when using pickle, so we think
that enabling pickle should require a deliberate action and not be
the default choice.

If you depend on pickle then you should set a setting to disable this
warning and to be sure that everything will continue working
when you upgrade to Celery 3.2::

    CELERY_ACCEPT_CONTENT = ['pickle', 'json', 'msgpack', 'yaml']

You must only enable the serializers that you will actually use.


  warnings.warn(CDeprecationWarning(W_PICKLE_DEPRECATED))

[2015-08-05 18:15:22,915: WARNING/MainProcess] /home/elijah/Desktop/trydjango18/trydjango18/local/lib/python2.7/site-packages/celery/apps/worker.py:161: CDeprecationWarning: 
Starting from version 3.2 Celery will refuse to accept pickle by default.

The pickle serializer is a security concern as it may give attackers
the ability to execute any command.  It's important to secure
your broker from unauthorized access when using pickle, so we think
that enabling pickle should require a deliberate action and not be
the default choice.

If you depend on pickle then you should set a setting to disable this
warning and to be sure that everything will continue working
when you upgrade to Celery 3.2::

    CELERY_ACCEPT_CONTENT = ['pickle', 'json', 'msgpack', 'yaml']

You must only enable the serializers that you will actually use.


  warnings.warn(CDeprecationWarning(W_PICKLE_DEPRECATED))


 -------------- celery@elijah-VirtualBox v3.1.18 (Cipater)
---- **** ----- 
--- * ***  * -- Linux-3.13.0-54-generic-x86_64-with-Ubuntu-14.04-trusty
-- * - **** --- 
- ** ---------- [config]
- ** ---------- .> app:         default:0x7f6ce3b3e410 (djcelery.loaders.DjangoLoader)
- ** ---------- .> transport:   amqp://guest:**@localhost:5672//
- ** ---------- .> results:     database
- *** --- * --- .> concurrency: 2 (prefork)
-- ******* ---- 
--- ***** ----- [queues]
 -------------- .> celery           exchange=celery(direct) key=celery


[tasks]
  . comparison.tasks.crawl

[2015-08-05 18:15:23,178: INFO/MainProcess] Connected to amqp://guest:**@127.0.0.1:5672//
[2015-08-05 18:15:23,276: INFO/MainProcess] mingle: searching for neighbors
[2015-08-05 18:15:24,322: INFO/MainProcess] mingle: all alone
/home/elijah/Desktop/trydjango18/trydjango18/local/lib/python2.7/site-packages/djcelery/loaders.py:136: UserWarning: Using settings.DEBUG leads to a memory leak, never use this setting in production environments!
  warn('Using settings.DEBUG leads to a memory leak, never '

[2015-08-05 18:15:24,403: WARNING/MainProcess] /home/elijah/Desktop/trydjango18/trydjango18/local/lib/python2.7/site-packages/djcelery/loaders.py:136: UserWarning: Using settings.DEBUG leads to a memory leak, never use this setting in production environments!
  warn('Using settings.DEBUG leads to a memory leak, never '

[2015-08-05 18:15:24,404: WARNING/MainProcess] celery@elijah-VirtualBox ready.
Community
  • 1
  • 1
loremIpsum1771
  • 2,497
  • 5
  • 40
  • 87
  • sorry - this is quite unclear - what exactly is the problem you are seeing? you say "it seems that the celery script takes over the terminal window" - do you mean that you cannot enter text in the terminal? this is expected behaviour - you are running the celery worker in the foreground. it seems that you want to enter text interactively into some process - you will not be able to do this with the celery worker. – scytale Aug 05 '15 at 12:12
  • you also seem to be doing your own subprocess management using `billiard.Process` - why are you doing that instead of just using celery subtasks? – scytale Aug 05 '15 at 12:13
  • @scytale The only problem was that I wasn't able to tell if the spider was actually executing the rest of the code after starting. I just made an update to the post showing the text in the terminal window. My spider is set up to take a band name from the user and concatenate that to the start_url. The raw_input command is executing, as can be seen from the terminal, but I can't see anything after that so I don't know if the response is being pipelined into the database. Also, I'm trying to pass the bandname as a scrapy argument, but I don't know how to do this with a celery task. – loremIpsum1771 Aug 05 '15 at 18:27
  • please try not to dump such large amounts of text into your question. In this case the log output shows that celery is running and waiting for tasks. – scytale Aug 06 '15 at 09:23
  • it's not clear what is going on in your code. why are you using the django command to manage celery? I would suggest starting small - follow the tutorial in the celery docs and use that as a starting point to write a small tasks that crawls a single page and writes data to a file - don't use a web interface or anything like that. once you have that task working it should be easier to integrate it with your web interface. – scytale Aug 06 '15 at 09:25
  • @scytale I have been able to run a script using celery and even this code seems to be working its just that I'm trying to figure out why I'm not seeing the normal execution of the spider. But as for the code, I was using the solution from this [SO post](http://stackoverflow.com/questions/22116493/run-a-scrapy-spider-in-a-celery-task?lq=1) so you can take a look at it – loremIpsum1771 Aug 08 '15 at 19:13
  • no I can't take a look at that. please try to explain better what the problem is. you question does not make it clear, and you have just said that your code sesms to be working - what exactly is the normal execution you are not seeing. no one can help you unless you are more clear – scytale Aug 10 '15 at 00:18
  • @scytale Whenever you run a Scrapy spider inside the terminal, there is a scrapy log that is displayed as the spider crawls each of the specified domains. However, when I try to run it using the celery task, only the initial part of the log is shown, but none of the actual web page crawling is displayed. So the problem is that not only does it seem that the spider code is not fully executing, but when I check the database table that the spider is supposed to pipeline the response into, it is empty. Have you ever run a scrapy spider in a celery task? If so, what does the log usually look like? – loremIpsum1771 Aug 13 '15 at 20:18
  • where is this output going to? stdout? stderr? logging? have you configured logging correctly? – scytale Aug 14 '15 at 09:21
  • have you configured celery to propagate exceptions? if not then if there is an exception in the crawler you may not see it. have you configured the results backend correctly? if no then the same applies. have you tried checking the `traceback` property of your task results? – scytale Aug 14 '15 at 09:25
  • @scytale The spider works perfectly when I run the command *scrapy crawl * in the terminal when inside of the root directory of the spider. When I have been trying to run the celery task in the django project, I was using the command *python manage.py celery worker --loglevel=info*. I think that this command might only be showing that the task is queued to run, but doesn't actually run it. What is the proper way to run celery tasks? I'm still very new to celery and the only way I have seen them being run in django is within the actual website and not from the terminal. – loremIpsum1771 Aug 14 '15 at 22:01
  • @scytale Ideally what I am trying to do is to have a form on a webpage that takes user input which will be passed to scrapy spiders and concatenated to a url. I was assuming I could use the event of a submit button being clicked to start the celery process of the passing of the form data to the spider and the execution of the spider, but I wasn't sure exactly how to go about this. – loremIpsum1771 Aug 14 '15 at 22:05
  • yes, all this is possible. please ask new questions but about 1 thing at a time - this question is kind of about everything. – scytale Aug 17 '15 at 08:59
  • @scytale I just posted a new question – loremIpsum1771 Aug 17 '15 at 16:42

0 Answers0