Webcrawling using Scrapy : alert when keyword spoted in forum

Question

a few months back, I started my search for some easy way to write a script that could alert me when a keyword is posted in a thread from a forum section.

So, my research leads me to the python module scrapy, that I was happy to try because I already knew some python.

I tried but the result I got was not satisfactory enough.

let's explain what i wanted:

I am interested in retrieving the threads from the forum classified section, check if a new message has been posted and send me a message if a new thread with a specific word appears in the title.

Here is my code, ntspider.py:

from scrapy.http import Request

class MySpider(BaseSpider):
    name = "LP195xSearch"
    allowed_domains = ["www.mylespaul.com"]
    start_urls = ["http://www.mylespaul.com/forums/member-classifieds/"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select('//a[contains(@id,"thread_title")]/text()').extract()
        t =[]

        for title in titles:
            t.append(title)
            item = LP195xSearchItem()
            item["title"] = title
            yield item

        for i in xrange(len(t)):
            print repr(str(t[i])).center(20)

This is only retrieving and printing the threads titles, and now I want to allert me if onw keywork is found.

Any help would be very very welcome.

score 2 · Answer 1 · edited May 23 '17 at 10:32

2

you don't really need scrapy for this, but for notifications I think you need to create something like the following:

Setup a cron job to execute your spider periodically (daily, hourly, what you want).
Setup a database where to drop your thread items.
When getting an item, check if it isn't already on your database and if the title contains your keyword and send a notification (you can check alternatives here or an email that is your choice).

I say that you don't really need scrapy here, because you just need to read plain text on a page which can be done inside a simple script with the requests library or one of your choice.

edited May 23 '17 at 10:32

Community

1
1

answered Nov 27 '15 at 23:24

eLRuLL

18,488
9
73
99

1

OK thanks, that is the kind of information I am looking for too... I had a quick look at the module requests, and it is satisfying to import the html page, but could you give me an advice on where to start if I want to get say Thread titles from the mega table? – Michael B Nov 30 '15 at 13:49
you can still use selectors, for that, check `parsel` readme https://github.com/scrapy/parsel – eLRuLL Nov 30 '15 at 14:13

Webcrawling using Scrapy : alert when keyword spoted in forum

1 Answers1