Pandas strftime running slow within multiprocessing

Question

I'm seeing a behavior when working with time stamps within pandas DataFrames. The project I'm working on requires I read in data, convert a date string to a date-time object, later the the date-time object is converted back to a string with a different format. I do the conversion using data_frame.dt.strftime(). When I move call my function using a multiprocessing pool, rather than calling it directly, I see a big slowdown when converting from a date-time to string. Below, I include some timing measurements, with and without multiprocessing. The code used to run these tests is at the bottom of the post.

No multiprocessing
Time to read: 0.286389112473
Time to convert to string: 0.160706996918
Total time: 0.4490878582 s

With multiprocessing
Time to read: 0.287585020065
Time to convert to string: 6.90422201157
Total time: 7.23801398277 s

This test shows that there is little difference in the time it takes to read in the DataFrame (converting string to date-time), but converting back out to a string takes much longer when calling from within multiprocessing. Is there a reason for this? Is there a way to work around it?

Edit:

Just to clarify, in this example, I'm intentionally supplying a single process to the multiprocessing pool. I'm interested in the difference in runtime when calling the testing function directly or via multiprocessing.

I certainly expect a small amount of overhead related to transferring data to the worker process, but I'm seeing added inefficiency when processing the DataFrame. That is, when I process the same data, with the same code, providing the same resources, it takes 40x longer to convert the column from date-time objects to strings. I take this hit again if I try and write the DataFrame to a file using to_csv()

My system configuration is the following:

OS: Mac
Python release: 2.7.11 |Anaconda 2.5.0
Pandas version: 0.18.1
Multiprocessing version: 0.70a1

Code to run timing test:

from __future__ import print_function

import time
import pandas as pd
import multiprocessing


# ------------------------------------------------------------------------------
def test_timing(file_name):
    date_format = '%a %b %d %H:%M:%S %Z %Y'
    date_parser = lambda x: pd.to_datetime(x, format=date_format)

    tic = time.time()
    df = pd.read_csv(file_name,
                     names=['date_time'],
                     parse_dates=['date_time'],
                     date_parser=date_parser,
                     nrows=50000)
    print('Time to read: {}'.format(time.time() - tic))

    tic = time.time()
    date_format = '%Y-%m-%d'
    df['date'] = df['date_time'].dt.strftime(date_format)
    print('Time to convert to string: {}'.format(time.time() - tic))

    return df


# ------------------------------------------------------------------------------
print('No multiprocessing')
tic = time.time()
df = test_timing('sample_data.csv')
# print(df.head())
print('Total time: {} s'.format(time.time() - tic))
print()

# ------------------------------------------------------------------------------
print('With multiprocessing')
tic = time.time()
pool = multiprocessing.Pool(1)
df = pool.map(test_timing, ['sample_data.csv'])
# print(df[0].head())
print('Total time: {} s'.format(time.time() - tic))

After doing a few more tests, it appears this happens in environment on my laptop, but not when working on our linux cluster (using the system linux). I'm not sure if this helps. — bjack3, May 24 '16 at 19:30
Sorry, I completely misunderstood the problem at first (and then again a bit). I ran this code on linux and the times don't differ significantly. And have no idea what is going on. So, I think my answer isn't useful at all and I'll delete it. (To be completely sure that it's nothing to do with `print`, I'd place `time.time()` outside of it.) — ptrj, May 25 '16 at 20:52
Thanks for your help! I will try installing another python distribution to see if it's something specific to the anaconda release for mac or something deeper on my computer. — bjack3, May 25 '16 at 21:57

Pandas strftime running slow within multiprocessing

0 Answers0