2

I'm trying to speed-up some multiprocessing code in Python 3. I have a big read-only DataFrame and a function to make some calculations based on the read values.

I tried to solve the issue writing a function inside the same file and share the big DataFrame as you can see here. This approach does not allow to move the process function to another file/module and it's a bit weird to access a variable outside the scope of the function.

import pandas as pd
import multiprocessing


def process(user):
    # Locate all the user sessions in the *global* sessions dataframe
    user_session = sessions.loc[sessions['user_id'] == user]
    user_session_data = pd.Series()

    # Make calculations and append to user_session_data

    return user_session_data


# The DataFrame users contains ID, and other info for each user
users = pd.read_csv('users.csv')

# Each row is the details of one user action. 
# There is several rows with the same user ID
sessions = pd.read_csv('sessions.csv')

p = multiprocessing.Pool(4)
sessions_id = sessions['user_id'].unique()

# I'm passing an integer ID argument to process() function so 
# there is no copy of the big sessions DataFrame
result = p.map(process, sessions_id)

Things I've tried:

  • Pass a DataFrame instead of integers ID arguments to avoid the sessions.loc... line of code. This approach slow down the script a lot.

Also, I've looked at How to share pandas DataFrame object between processes? but didn't found a better way.

Community
  • 1
  • 1
David Gasquez
  • 327
  • 6
  • 16

1 Answers1

2

You can try defining process as:

def process(sessions, user):
   ...

And put it wherever you prefer.

Then when you call the p.map you can use the functools.partial function, that allow to incrementally specify arguments:

 from functools import partial
 ...

 p.map(partial(process, sessions), sessions_id)

This should not slow the processing too much and answer to your issues.

Note that you could do the same without partial as well, using:

 p.map(lambda id: process(sessions,id)), sessions_id)
Teudimundo
  • 2,610
  • 20
  • 28
  • It works without slowing the processing too much. Is it common to have functions with a signature like `func(big_df, id)` ? – David Gasquez Feb 03 '16 at 14:06
  • I don't see any issue with that. if you really don't like it you can think of having an object that takes the df in the constructor and with a `process` method that takes only the id. And then you call the method in the `p.map`. I would call that over-engineering, but if your actual scenario is more complex it could make sense. (please accept the answer if you think it's worthing) – Teudimundo Feb 03 '16 at 14:19