Improve Speed of Python Script: Multithreading or Multiple Instances?

Question

I have a Python script that I'd like to run everyday and I'd prefer that it only takes 1-2 hours to run. It's currently setup to hit 4 different APIs for a given URL, capture the results, and then save the data into a PostgreSQL database. The problem is I have over 160,000 URLs to go through and the script ends up taking a really long time -- I ran some preliminary tests and it would take over 36 hours to go through each URL in its current format. So, my question boils down to: should I optimize my script to run multiple threads at the same time? Or should I scale out the number of servers I'm using? Obviously the second approach will be more costly so I'd prefer to have multiple threads running on the same instance.

I'm using a library I created (SocialAnalytics) which provides methods to hit the different API endpoints and parse the results. Here's how I have my script configured:

import psycopg2
from socialanalytics import pinterest
from socialanalytics import facebook
from socialanalytics import twitter
from socialanalytics import google_plus
from time import strftime, sleep

conn = psycopg2.connect("dbname='***' user='***' host='***' password='***'")
cur = conn.cursor()

# Select all URLs
cur.execute("SELECT * FROM urls;")
urls = cur.fetchall()

for url in urls:

    # Pinterest
    try:
        p = pinterest.getPins(url[2])
    except:
        p = { 'pin_count': 0 }
    # Facebook
    try:
        f = facebook.getObject(url[2])
    except:
        f = { 'comment_count': 0, 'like_count': 0, 'share_count': 0 }
    # Twitter
    try:
        t = twitter.getShares(url[2])
    except:
        t = { 'share_count': 0 }
    # Google
    try:
        g = google_plus.getPlusOnes(url[2])
    except:
        g = { 'plus_count': 0 }

    # Save results
    try:
        now = strftime("%Y-%m-%d %H:%M:%S")
        cur.execute("INSERT INTO social_stats (fetched_at, pinterest_pins, facebook_likes, facebook_shares, facebook_comments, twitter_shares, google_plus_ones) VALUES(%s, %s, %s, %s, %s, %s, %s, %s);", (now, p['pin_count'], f['like_count'], f['share_count'], f['comment_count'], t['share_count'], g['plus_count']))
        conn.commit()
    except:
        conn.rollback()

You can see that each call to the API is using the Requests library, which is a synchronous, blocking affair. After some preliminary research I discovered Treq, which is an API on top of Twisted. The asynchronous, non-blocking nature of Twisted seems like a good candidate for improving my approach, but I've never worked with it and I'm not sure how exactly (and if) it'll help me achieve my goal.

Any guidance is much appreciated!

Do the various url requests require a shared memory space? If not, a simple solution would be to use multiprocessing. For some quick code, see: http://stackoverflow.com/questions/3842237/parallel-processing-in-python — duhaime, Nov 09 '14 at 02:44
No, they don't need a shared memory space. The only way I can think of shaving off the amount of time needed would be to run 10 processes, each dealing with their own URL. However, I'm not exactly sure how to accomplish this. Am I heading in the right direction with this approach? https://gist.github.com/anonymous/b337afbd8f92d3991b47 — Abundnce10, Nov 09 '14 at 17:55
I got the multiprocessing working, but now I'm running into limitations in the number of API requests I can make to Facebook :( — Abundnce10, Nov 09 '14 at 22:33
I'm actually not using any credentials when accessing the Facebook API https://github.com/Abundnce10/socialanalytics/blob/master/socialanalytics/facebook.py#L4 Facebook offers the ability to ask for an exception to the API threshold but I'm not sure if my case would apply https://www.facebook.com/help/contact/?id=206103619499608 — Abundnce10, Nov 10 '14 at 00:29
Hmm, how many API calls are you making each day? You might want to open a new question on SO regarding ways to resolve the problem. If you're firing fewer than 100M requests a day then you should be well within Facebook's restrictions, no? — duhaime, Nov 10 '14 at 13:08

Jimilian · Accepted Answer · 2014-11-10T07:52:29.637

2

At first you should measure time that your script spends on every step. May be you discover something interesting :)

Second, you can split your urls on chunks:

chunk_size = len(urls)/cpu_core_count; // don't forget about remainder of division

After these steps you can use multiprocessing for processing every chunk in parallel. Here is example for you:

import multiprocessing as mp

p = mp.Pool(5)

# first solution
for urls_chunk in urls: # urls = [(url1...url6),(url7...url12)...]
    res = p.map(get_social_stat, urls_chunk)
    for record in res:
        save_to_db(record)

# or, simple
res = p.map(get_social_stat, urls)

for record in res:
   save_to_db(record)

Also, gevent can help you. Because it can optimize time spending on processing sequence of synchronous blocking requests.

edited Nov 10 '14 at 07:52

answered Nov 09 '14 at 08:23

Jimilian

3,859
30
33

I only have 1 CPU right now (running on smallest DigitalOcean droplet), however, I can increase this is needed. Will I not be able to run 10 or more processes on my current VPS? If not, I don't think I'll be able to shave off the time I need -- currently takes 36 hours, I would like to get it down to 1-4 hours. I've extracted out the logic from the `for` loop into their own functions but I'm still unsure how to setup the multiprocessing part. Can you take a look? https://gist.github.com/anonymous/b337afbd8f92d3991b47 – Abundnce10 Nov 09 '14 at 18:03
Of course, you can spawn 10 processes. But better is find optimal number by performance test. Try to parse 100 urls using different number of processes. Try to load your cpu at 70-80 percentages. – Jimilian Nov 09 '14 at 19:25
I used your code above and found 10 processes only brought my CPU usage up to around 35%. However, after a few tests I found I hit the limit for Facebook's API. I'm now getting `(#4) Application request limit reached` http://stackoverflow.com/questions/14092989/facebook-api-4-application-request-limit-reached. Thanks for your help! – Abundnce10 Nov 09 '14 at 20:41
I actually noticed a problem with the code above. If I start with 100 URLs I only get 50 items in the `res` list. Do you know why that would happen? Here's the code, along with the output it produces: https://gist.github.com/anonymous/1d7d1d21029b42b74d20 – Abundnce10 Nov 09 '14 at 20:53
It looks like only the latter half of the `url_chunks` are making it into the `res` list. Any idea why the first half aren't being included? – Abundnce10 Nov 09 '14 at 21:02
Disregard my last 2 comments. I was doing `for url in urls:` and then calling `urls.pop()` which is why I was only getting half of the items from `urls`. Thanks again! – Abundnce10 Nov 09 '14 at 21:11

Improve Speed of Python Script: Multithreading or Multiple Instances?

1 Answers1

Linked