I have df_fruits
, which is a dataframe of fruits.
index name
1 apple
2 banana
3 strawberry
and, its market prices are in mysql database like below,
category market price
apple A 1.0
apple B 1.5
banana A 1.2
banana A 3.0
apple C 1.8
strawberry B 2.7
...
During the iteration in df_fruits
, I'd like to do some processes.
The code below is a non-parallel version.
def process(fruit):
# make DB connection
# fetch the prices of fruit from database
# do some processing with fetched data, which takes a long time
# insert the result into DB
# close DB connection
for idx, f in df_fruits.iterrows():
process(f)
What I want to do is to do process
on each row in df_fruits
in parallel, since df_fruits
has plenty of rows and the table size of the market prices is quite large (fetching data takes a long time).
As you can see, the order of execution between rows does not matter and there's no sharing data.
Within iteration in df_fruits
, I'm confused about where to locate `pool.map(). Do I need to split the rows before parallel execution and distribute chunks to each process? (If so, a process which finished its job earlier than other process would be idle?)
I've researched of pandarallel but I can't use it (my os is windows).
Any help would be appreciated.