I have a Python script that I'd like to run everyday and I'd prefer that it only takes 1-2 hours to run. It's currently setup to hit 4 different APIs for a given URL, capture the results, and then save the data into a PostgreSQL database. The problem is I have over 160,000 URLs to go through and the script ends up taking a really long time -- I ran some preliminary tests and it would take over 36 hours to go through each URL in its current format. So, my question boils down to: should I optimize my script to run multiple threads at the same time? Or should I scale out the number of servers I'm using? Obviously the second approach will be more costly so I'd prefer to have multiple threads running on the same instance.
I'm using a library I created (SocialAnalytics) which provides methods to hit the different API endpoints and parse the results. Here's how I have my script configured:
import psycopg2
from socialanalytics import pinterest
from socialanalytics import facebook
from socialanalytics import twitter
from socialanalytics import google_plus
from time import strftime, sleep
conn = psycopg2.connect("dbname='***' user='***' host='***' password='***'")
cur = conn.cursor()
# Select all URLs
cur.execute("SELECT * FROM urls;")
urls = cur.fetchall()
for url in urls:
# Pinterest
try:
p = pinterest.getPins(url[2])
except:
p = { 'pin_count': 0 }
# Facebook
try:
f = facebook.getObject(url[2])
except:
f = { 'comment_count': 0, 'like_count': 0, 'share_count': 0 }
# Twitter
try:
t = twitter.getShares(url[2])
except:
t = { 'share_count': 0 }
# Google
try:
g = google_plus.getPlusOnes(url[2])
except:
g = { 'plus_count': 0 }
# Save results
try:
now = strftime("%Y-%m-%d %H:%M:%S")
cur.execute("INSERT INTO social_stats (fetched_at, pinterest_pins, facebook_likes, facebook_shares, facebook_comments, twitter_shares, google_plus_ones) VALUES(%s, %s, %s, %s, %s, %s, %s, %s);", (now, p['pin_count'], f['like_count'], f['share_count'], f['comment_count'], t['share_count'], g['plus_count']))
conn.commit()
except:
conn.rollback()
You can see that each call to the API is using the Requests library, which is a synchronous, blocking affair. After some preliminary research I discovered Treq, which is an API on top of Twisted. The asynchronous, non-blocking nature of Twisted seems like a good candidate for improving my approach, but I've never worked with it and I'm not sure how exactly (and if) it'll help me achieve my goal.
Any guidance is much appreciated!