0

I'm new to coding and this my first project. So far I've pieced together what I have through Googling, Tutorials and Stack.

I'm trying to add data from a pandas df of scraped RSS feeds to a remote sql database, then host the script on heroku or AWS and have the script running every hour.

Someone on here recommend that I use APScheduler as in this post.

I'm struggling though as there aren't any 'dummies' tutorials around APScheduler. This is what I've created so far.

I guess my question is does my script need to be in a function for APScheduler to trigger it or can it work another way.

from apscheduler.schedulers.blocking import BlockingScheduler

sched = BlockingScheduler()

@sched.scheduled_job('interval', minutes=1)

sched.configure()
sched.start()

import pandas as pd
from pandas.io import sql
import feedparser
import time


rawrss = ['http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml',
          'https://www.yahoo.com/news/rss/',
          'http://www.huffingtonpost.co.uk/feeds/index.xml',
          'http://feeds.feedburner.com/TechCrunch/',
          'https://www.uktech.news/feed'
         ]

time = time.strftime('%a %H:%M:%S')
summary = 'text'

posts = []
for url in rawrss:
    feed = feedparser.parse(url)
    for post in feed.entries:
        posts.append((time, post.title, post.link, summary))

df = pd.DataFrame(posts, columns=['article_time','article_title','article_url', 'article_summary']) # pass data to init
df.set_index(['article_time'], inplace=True)

import pymysql
from sqlalchemy import create_engine

engine = create_engine('mysql+pymysql://<username>:<host>:3306/<database_name>?charset=utf8', encoding = 'utf-8')
engine.execute("INSERT INTO rsstracker VALUES('%s', '%s', '%s','%s')" % (time, post.title, post.link, summary))

df.to_sql(con=engine, name='rsstracker', if_exists='append') #, flavor='mysql'
Nick Duddy
  • 910
  • 6
  • 20
  • 36

1 Answers1

1

Yes. What you want to be executed must be a function (or another callable, like a method). The decorator syntax (@sched.…) needs a function definition (def …) to which the decorator is applied. The code in your example doesn't compile.

Then it's a blocking scheduler, meaning if you call sched.start() this method doesn't return (unless you stop the scheduler in some scheduled code) and nothing after the call is executed.

Imports should go to the top, then it's easier to see what the module depends on. And don' import things you don't actually use.

I'm not sure why you import and use pandas for data that doesn't really need DataFrame objects. Also SQLAlchemy without actually using anything this library offers and formatting values as strings into an SQL query which is dangerous!

Just using SQLAlchemy for the database it may look like this:

#!/usr/bin/env python
# coding: utf-8
from __future__ import absolute_import, division, print_function
from time import strftime

import feedparser
from apscheduler.schedulers.blocking import BlockingScheduler
from sqlalchemy import create_engine, MetaData

sched = BlockingScheduler()


RSS_URLS = [
    'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml',
    'https://www.yahoo.com/news/rss/',
    'http://www.huffingtonpost.co.uk/feeds/index.xml',
    'http://feeds.feedburner.com/TechCrunch/',
    'https://www.uktech.news/feed',
]


@sched.scheduled_job('interval', minutes=1)
def process_feeds():
    time = strftime('%a %H:%M:%S')
    summary = 'text'

    engine = create_engine(
        'mysql+pymysql://<username>:<host>:3306/<database_name>?charset=utf8'
    )
    metadata = MetaData(engine, reflect=True)
    rsstracker = metadata.tables['rsstracker']

    for url in RSS_URLS:
        feed = feedparser.parse(url)
        for post in feed.entries:
            (
                rsstracker.insert()
                    .values(
                        time=time,
                        title=post.title,
                        url=post.link,
                        summary=summary,
                    )
                    .execute()
            )


def main():
    sched.configure()
    sched.start()


if __name__ == '__main__':
    main()

The time column seems a bit odd, I would have expected a TIMESTAMP or DATETIME here and not a string that throws away much of the information, just leaving the abbreviated week day and the time.

BlackJack
  • 4,476
  • 1
  • 20
  • 25
  • thanks! I need to read through your script to understand how you've did this. It's so tidy compared to mine. As for Pandas, I started by just reading into and progressed I knew it wasn't right but it was working and I didn't want to break it before I'd manage to get to my first goal, sending it to a database. I'll look at changing time to a TIMESTAMP, that's what I was trying to achieve. Can I use SQLAlchemy without pymysql? – Nick Duddy Sep 13 '17 at 19:07
  • 1
    You need to have _a_ module to talk to the MySQL database, it doesn't have to be `pymysql`. The SQLAlchemy documentation [lists all supported modules to connect with MySQL](http://docs.sqlalchemy.org/en/latest/dialects/mysql.html) and their features/restrictions. – BlackJack Sep 14 '17 at 10:42
  • I keep getting a KeyError. I deleted and recreated the table and it's still occurring Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/apscheduler/executors/base.py", line 125, in run_job retval = job.func(*job.args, **job.kwargs) File "", line 20, in process_feeds rsstracker = metadata.tables['rsstracker'] KeyError: 'rsstracker' – Nick Duddy Sep 25 '17 at 20:09
  • Then there is no table named _rsstracker_ or maybe you need to give a more qualified name there. What does `print(metadate.tables.keys())` tell you? – BlackJack Sep 29 '17 at 13:18