0

I'm setting up a flask app that will allow me to input a string and it will pass that string argument to my spider to webscrape a page. I'm having difficulty getting the spider to run on the press of a form submit(integrating scrapy&flask).

I've looked at the following code snippet solutions to no avail: Run Scrapy from Flask, Running Scrapy spiders in a Celery task, Scrapy and celery `update_state`

It definitely appears that there are different ways to complete the task. However - each of the code snippets above does not appear to be working.

routes.py

from flask import render_template, flash, redirect, url_for, session, jsonify
from flask import request
from flask_login import login_required
from flask_login import logout_user
from app import app, db
from app.forms import LoginForm
from flask_login import current_user, login_user
from app.models import User
from werkzeug.urls import url_parse
from app.forms import RegistrationForm, SearchForm
#from app.tasks import scrape_async_job
import pprint
import requests
import json

@app.route('/')
@app.route('/index', methods=['GET','POST'])
@login_required
def index():
    jobvisuals = [
        {
            'Job': 'Example',
            'Desc': 'This job requires a degree...',
            'link': 'fakelink',
            'salary': '10$/hr',
            'applied': 'Boolean',
            'interview': 'Boolean'}]
    params = {
        'spider_name': 'Indeedspider',
        'start_requests': True
    }
    response = requests.get('http://localhost:9080/crawl.json', params).json()
    data = response
    pprint.pprint(data)
    form = SearchForm()
    if request.method == 'GET':
        return render_template('index.html', title='home', jobvisuals=jobvisuals, form=form, search=session.get('search',''))
    job_find = request.form['search']
    session['search'] = job_find
    if form.validate_on_submit():
        print('Working on this feature :D')
        flash('Searching for job {}').format(form.search.data)

    return render_template('index.html', title='Home', jobvisuals=jobvisuals, form=form)

spider

import scrapy

class IndeedSpider(scrapy.Spider):
    name = 'indeedspider'
    allowed_domains = ['indeed.com']

    def __init__(self, job='', **kwargs):
        self.start_url('http://www.indeed.com/jobs?q={job}&l=San+Marcos%2C+CA')
        super().__init__(**kwargs)

    def parse(self, response):
        for item in response.xpath("//div[contains(@class,'jobsearch-SerpJobCard unifiedRow row result clickcard')]").getall():
            yield {
                'title': item.xpath("//div[contains(@class,'title')]/text()").get(default='None'),
                'desc': item.xpath("//div[contains(@class,'summary')]/text()").get(default='None'),
                'link': item.xpath("//div[contains(@class,'title')]/@href").get(default='None'),
                'location': item.xpath("//span[contains(@class,'location')]/text()").get(default='None'),
                'salary': item.xpath("//div[contains(@class,'salarySnippet')]/text()").get(default='None')
            }

Expected:

I type in a input box the job, job gets passed to spider on submit, spider scrapes indeed.com and pulls the first page only and returns that data on the index page.

Actual: Unsure of where to start.

Can anyone point me in the right direction?

Michael G
  • 71
  • 7
  • See https://docs.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script – Gallaecio Apr 30 '19 at 10:28
  • @Gallaecio The code provided gives me a 'ValueError: signal only works in main thread.' [This thread](https://stackoverflow.com/questions/36384286/how-to-integrate-flask-scrapy/37270442) gives three potential solutions for me to use. I will need to figure out how to incorporate them as I am running scrapy splash in a docker container on a VM. It would be nice to run the HTTP API server in another docker container on the VM... – Michael G Jun 20 '19 at 05:32

0 Answers0