Processing values in a DataFrame in parallel

Question

I've got a DataFrame where I need to process a value on each row by passing it to an external function that I don't have control over, and I'd like to do it as fast as possible (limited at 20req/s by an external API)

d1 = {1:['Test','Test1','Test2'], 2:['file1','file2','file3'],3:[pd.NA,pd.NA,pd.NA]}
df = pd.DataFrame(data=d1)

       1      2     3
0   Test  file1  <NA>
1  Test1  file2  <NA>
2  Test2  file3  <NA>

What would be the best way for me to send values from column2 to function process_file(), saving the returned value in column3 and doing this in parallel to go as fast as possible (keeping the limit in mind). My first point of call would usually be asyncio, however in this case since process_file() is not asyncio enabled, I'm stuck.

Anyone?

score 0 · Answer 1 · answered Jul 09 '22 at 21:11

0

You can send values from a column to a function using the pandas.Series.apply method:

df.loc[:, 3] = df.loc[:, 2].apply(process_file())

This topic explains how you can rate limit your requests:

How to limit the rate/speed of python pandas apply when calling an API?

answered Jul 09 '22 at 21:11

hoomant

455
2
12

Thanks but from what I can tell .apply won't run in parallel, and I'll still be sending one request at a time, waiting for the function to return before applying it to the next value. – splotsh Jul 10 '22 at 09:41
For anyone else that finds this - I ended up using pandarallel – splotsh Jul 10 '22 at 15:03
That's right. You can use the ```swifter``` package as it is explained in [Make Pandas DataFrame apply() use all cores?](https://stackoverflow.com/a/51669468/14005384) or the ```pandarallel```. – hoomant Jul 10 '22 at 15:27

Processing values in a DataFrame in parallel

1 Answers1