7

My objective is to make a call to an API for each row in a Pandas DataFrame, which contains a List of strings in the response JSON, and creating a new DataFrame with one row per response. My code basically looks like this:

i = 0
new_df = pandas.DataFrame(columns = ['a','b','c','d'])
for index,row in df.iterrows():
    url = 'http://myAPI/'
    d = '{"SomeJSONData:"' + row['data'] + '}'
    j = json.loads(d)
    response = requests.post(url,json = j)

    data = response.json()
    for new_data in data['c']:
        new_df.loc[i] = [row['a'],row['b'],row['c'],new_data]
        i += 1

This works fine, but I'm making about 5500 API calls and writing about 6500 rows to the new DataFrame so it takes a while, maybe 10 minutes. I was wondering if anyone knew of a way to speed this up? I'm not too familiar with running parallel for loops in Python, could this be done while maintaining thread safety?

Martin Boros
  • 780
  • 3
  • 9
  • 22
  • Since your data-frame is has several `dtype=object` columns, just using `iterrows` is going to be about as fast as you can get. You can try threading, since requests are i/o bound. In which case, look at the examples [here](https://stackoverflow.com/questions/2632520/what-is-the-fastest-way-to-send-100-000-http-requests-in-python). Do note, that question was originally posed 7 years ago. So look at the more recent examples. – juanpa.arrivillaga Oct 17 '17 at 21:12
  • 2
    I would use [`requests-futures`](https://github.com/ross/requests-futures) outside of pandas to make async requests, get the results, and rebuild the column after that. – roganjosh Oct 17 '17 at 21:13
  • 1
    It's not the looping that is an issue. It's the api calls. I have no idea if the web api can handle a bulk query. I'm no expert at what @roganjosh suggested... but that sounds like a good idea. – piRSquared Oct 17 '17 at 21:13
  • 2
    Also, you should consider not using `requests` because it is synchronous by design. Check out some options [here](http://mahugh.com/2017/05/23/http-requests-asyncio-aiohttp-vs-requests/) – juanpa.arrivillaga Oct 17 '17 at 21:16

1 Answers1

8

Something along these lines perhaps? This way you aren't creating a whole new dataframe, you're only declaring URL once, and you're taking advantage of the fact that pandas column operations are faster than row by row stuff.

url = 'http://myAPI/'

def request_function(j):
    return requests.post(url,json = json.loads(j))['c'] 

df['j']= '{"SomeJsonData:"' + df['data'] + '}'
df['new_data'] = df['j'].apply(request_function)

Now to prove that using apply in this case ( String data ) is indeed much faster, here's a simple test:

import numpy as np
import pandas as pd
import time

def func(text):
    return text + ' is processed'


def test_one():
    data =pd.DataFrame(columns = ['text'], index = np.arange(0, 100000))
    data['text'] = 'text'

    start = time.time()
    data['text'] = data['text'].apply(func)
    print(time.time() - start)


def test_two():
    data =pd.DataFrame(columns = ['text'], index = np.arange(0, 100000))
    data['text'] = 'text'

    start = time.time()

    for index, row in data.iterrows():
        data.loc[index, 'text'] = row['text'] + ' is processed'

    print(time.time() - start)

Results of string operations on dataframes.

test_one(using apply) : 0.023002147674560547

test_two(using iterrows): 18.912891149520874

Basically, by using the built-in pandas operations of adding the two columns and apply, you should have somewhat faster results, your response time is indeed limited by the API response time. If the results are still too slow, you might what to consider writing an async function that saves the results to a list. Then you send.apply that async function.

2yan
  • 307
  • 2
  • 12
  • 1
    `.apply` is *not* faster than a for-loop, generally. Indeed, it *is a Python for loop, under the hood*. – juanpa.arrivillaga Oct 17 '17 at 21:29
  • 4
    @juanpa.arrivillaga Please don't spread misinformation. Under the hood pandas performs many optimizations . The apply function ultimately takes advantage of a number of internal optimizations, such as using iterators in Cython. If you disagree with that, try testing it out yourself. Statistics speak louder than words. – 2yan Oct 17 '17 at 21:46
  • True OP could take send the .apply function an async function that appends the results into a file on disk, or adds it to a list, and bam, lightning fast. – 2yan Oct 17 '17 at 21:49
  • @roganjosh Check my answer, I've put a quick study comparing apply vs iterrows, regarding, string operations in my response. – 2yan Oct 17 '17 at 21:52
  • @roganjosh You don't need to believe me, just time it yourself. – 2yan Oct 17 '17 at 22:01
  • 1
    You are correct, I was thinking of `pandas.Dataframe.apply` with `axis=1`, which *does*, as far as I can tell, revert to a python for-loop (coming in at about the same time as iterrows). I am actually *very surprised* at just how fast `pd.Series.apply` is with string operations. Digging deeper, it seems to actually be the `loc` based assignment with strings that is tripping everything up. In other words, applying the function, and iterating over the data-frame with `itertuples` gives something around `0.10` seconds. Still slower, but not 2 orders of magnitude – juanpa.arrivillaga Oct 17 '17 at 22:39
  • IOW, `itertuples` *without* assignment gives `0.08292317390441895`, whereas with `apply` 0.05028414726257324. – juanpa.arrivillaga Oct 17 '17 at 22:42
  • @juanpa.arrivillaga I tend to work with a lot of big data and in general, I always avoid not using .apply, I've found the performance impact tends to grow massively as the data sizes increase. ( In my example using a 100k rows) It was 0.023 vs 18.91 Which is a massive difference, given that ultimately, the same calculations are being done. – 2yan Oct 17 '17 at 22:49
  • 1
    Yes, but as you see, it's not actually `.apply` vs `itertuples`, it's the `iloc`-based *assignment* that slows things down. So, if you modify test 2 to append to a list, then use `data['text'] = accumulator_list`, the performance difference is about `0.9` vs `0.5` on my machine (vs about 20 sec for `iloc` assignment). This enormous penalty of doing `df.iloc[x, y] = z` was totally a suprise for me, then again, I essentially *never* do that. – juanpa.arrivillaga Oct 17 '17 at 22:51
  • That's a good point: With those changes we get a difference of: `Test 1: 0.022 Test 2: 5.244 ` That's still a 238% slow down but it's a lot less than I originally thought. – 2yan Oct 17 '17 at 23:06
  • This looks good, thank you for giving a test that shows how much faster apply is then a for loop. However, the main issue is that the new DataFrame has more rows then the original DataFrame(6500 vs 5500). This is because the JSON returned by the API contains a list, and I need to create a new row for each element passed the first one, so I don't think apply can work in this situation. – Martin Boros Oct 18 '17 at 12:51
  • Now, generally for speed, you want to use all of pandas functions: So instead of iterating through the json you can try pd.read_json(json_data) and then have a global variable with your final list that you just .append to in the function that you call. Although I don't want to just spout stuff off without empirically measuring it. – 2yan Oct 18 '17 at 15:11