Goal: cycle through 160,000 URLs (from a database) and hit 4 different API and store results into a database.
I tried to solving the problem with multiprocessing but it turns out there are API rate limits (600 requests max. in 600 seconds) that prevented me from spawning multiple processes on the same server.
So, it looks like I'm going to have to spin up multiple servers, run the script (modified to pull a URL from the database server), and then save the results back into the database server.
Questions:
- How do I prevent multiple servers from using the same URL?
- Do you have any other advice/guidance?
Details/Requirements:
- Postgres database
- Server with database only has 512MB of RAM (can increase if needed)
- I'd like the total run time of the job to be under 2 hours
Here is the Python script in its current format:
import psycopg2
from socialanalytics import pinterest
from socialanalytics import facebook
from socialanalytics import twitter
from socialanalytics import google_plus
from time import strftime, sleep
conn = psycopg2.connect("dbname='***' user='***' host='***' password='***'")
cur = conn.cursor()
# Select all URLs
cur.execute("SELECT * FROM urls;")
urls = cur.fetchall()
for url in urls:
# Pinterest
try:
p = pinterest.getPins(url[2])
except:
p = { 'pin_count': 0 }
# Facebook
try:
f = facebook.getObject(url[2])
except:
f = { 'comment_count': 0, 'like_count': 0, 'share_count': 0 }
# Twitter
try:
t = twitter.getShares(url[2])
except:
t = { 'share_count': 0 }
# Google
try:
g = google_plus.getPlusOnes(url[2])
except:
g = { 'plus_count': 0 }
# Save results
try:
now = strftime("%Y-%m-%d %H:%M:%S")
cur.execute("INSERT INTO social_stats (fetched_at, pinterest_pins, facebook_likes, facebook_shares, facebook_comments, twitter_shares, google_plus_ones) VALUES(%s, %s, %s, %s, %s, %s, %s, %s);", (now, p['pin_count'], f['like_count'], f['share_count'], f['comment_count'], t['share_count'], g['plus_count']))
conn.commit()
except:
conn.rollback()
Here's how I'm currently thinking about approaching the problem:
- Create an API on the server with the database
- returns a URL when you perform a GET on
/api/v1/next-url
- accepts POST request to
/api/v1/store-results
- returns a URL when you perform a GET on
- Manually spin up ~25 servers
- Create script that hits my API to grab a URL, hits the 4 external APIs, then posts the data back to my database thru my API. Script terminates when no URL is returned through first API endpoint.
- Manually spin down servers
Any help is much appreciated!