0

Goal: cycle through 160,000 URLs (from a database) and hit 4 different API and store results into a database.

I tried to solving the problem with multiprocessing but it turns out there are API rate limits (600 requests max. in 600 seconds) that prevented me from spawning multiple processes on the same server.

So, it looks like I'm going to have to spin up multiple servers, run the script (modified to pull a URL from the database server), and then save the results back into the database server.

Questions:

  • How do I prevent multiple servers from using the same URL?
  • Do you have any other advice/guidance?

Details/Requirements:

  • Postgres database
  • Server with database only has 512MB of RAM (can increase if needed)
  • I'd like the total run time of the job to be under 2 hours

Here is the Python script in its current format:

import psycopg2
from socialanalytics import pinterest
from socialanalytics import facebook
from socialanalytics import twitter
from socialanalytics import google_plus
from time import strftime, sleep

conn = psycopg2.connect("dbname='***' user='***' host='***' password='***'")
cur = conn.cursor()

# Select all URLs
cur.execute("SELECT * FROM urls;")
urls = cur.fetchall()

for url in urls:

    # Pinterest
    try:
        p = pinterest.getPins(url[2])
    except:
        p = { 'pin_count': 0 }
    # Facebook
    try:
        f = facebook.getObject(url[2])
    except:
        f = { 'comment_count': 0, 'like_count': 0, 'share_count': 0 }
    # Twitter
    try:
        t = twitter.getShares(url[2])
    except:
        t = { 'share_count': 0 }
    # Google
    try:
        g = google_plus.getPlusOnes(url[2])
    except:
        g = { 'plus_count': 0 }

    # Save results
    try:
        now = strftime("%Y-%m-%d %H:%M:%S")
        cur.execute("INSERT INTO social_stats (fetched_at, pinterest_pins, facebook_likes, facebook_shares, facebook_comments, twitter_shares, google_plus_ones) VALUES(%s, %s, %s, %s, %s, %s, %s, %s);", (now, p['pin_count'], f['like_count'], f['share_count'], f['comment_count'], t['share_count'], g['plus_count']))
        conn.commit()
    except:
        conn.rollback()

Here's how I'm currently thinking about approaching the problem:

  1. Create an API on the server with the database
    • returns a URL when you perform a GET on /api/v1/next-url
    • accepts POST request to /api/v1/store-results
  2. Manually spin up ~25 servers
  3. Create script that hits my API to grab a URL, hits the 4 external APIs, then posts the data back to my database thru my API. Script terminates when no URL is returned through first API endpoint.
  4. Manually spin down servers

Any help is much appreciated!

Community
  • 1
  • 1
Abundnce10
  • 2,150
  • 3
  • 26
  • 41
  • How is the rate limit enforced? If it's via an API key you're absolutely barking up the wrong tree. I only ask because rate limits are very rarely enforced via IP, which would have to be the case for what you're implying. – Slater Victoroff Feb 11 '15 at 06:28
  • I'm not using an API key, so they must be enforced via IP address http://stackoverflow.com/a/8713296/899904 – Abundnce10 Feb 11 '15 at 06:32
  • Where will you spin up the VMs? - as they need to have different public IPs. Other than that your idea sounds good excep the manual part, try using something like [ansible](http://www.ansible.com/home) to spin up the machines, distribute the python script and start running it – peter Feb 11 '15 at 08:24
  • Shouldn't you contact the API provider and ask them? The limits are there for a reason and trying to circumvent them doesn't sound like a good idea. – Jakub Kania Feb 11 '15 at 09:21
  • @peter I use DigitalOcean as my VPS provider. I just assumed each droplet would get its own unique public IP, is that not the case? – Abundnce10 Feb 11 '15 at 15:42
  • @Abundnce10 yeah should be. And in that case ansible has features to automatically provision those droplets – peter Feb 11 '15 at 15:48
  • @peter I've researched ansible and it seems like an easy way to spin up and configure servers. Can I also use it to start a python script? Or will I need to use cron? – Abundnce10 Feb 11 '15 at 15:55
  • @Abundnce10 should be able to run arbitrary commands as well (run python scripts,...) – peter Feb 11 '15 at 15:57

1 Answers1

0

How do I prevent multiple servers from using the same URL?

One easy way is to take a hash of URL modulo server count. That gives you the index of the server that should handle that URL.

E.g.:

server_idx = zlib.crc32(url) % server_count
Maxim Egorushkin
  • 131,725
  • 17
  • 180
  • 271