I'm seeing a behavior when working with time stamps within pandas DataFrames. The project I'm working on requires I read in data, convert a date string to a date-time object, later the the date-time object is converted back to a string with a different format. I do the conversion using data_frame.dt.strftime()
. When I move call my function using a multiprocessing pool, rather than calling it directly, I see a big slowdown when converting from a date-time to string. Below, I include some timing measurements, with and without multiprocessing. The code used to run these tests is at the bottom of the post.
No multiprocessing
Time to read: 0.286389112473
Time to convert to string: 0.160706996918
Total time: 0.4490878582 s
With multiprocessing
Time to read: 0.287585020065
Time to convert to string: 6.90422201157
Total time: 7.23801398277 s
This test shows that there is little difference in the time it takes to read in the DataFrame (converting string to date-time), but converting back out to a string takes much longer when calling from within multiprocessing. Is there a reason for this? Is there a way to work around it?
Edit:
Just to clarify, in this example, I'm intentionally supplying a single process to the multiprocessing pool. I'm interested in the difference in runtime when calling the testing function directly or via multiprocessing.
I certainly expect a small amount of overhead related to transferring data to the worker process, but I'm seeing added inefficiency when processing the DataFrame. That is, when I process the same data, with the same code, providing the same resources, it takes 40x longer to convert the column from date-time objects to strings. I take this hit again if I try and write the DataFrame to a file using to_csv()
My system configuration is the following:
- OS: Mac
- Python release: 2.7.11 |Anaconda 2.5.0
- Pandas version: 0.18.1
- Multiprocessing version: 0.70a1
Code to run timing test:
from __future__ import print_function
import time
import pandas as pd
import multiprocessing
# ------------------------------------------------------------------------------
def test_timing(file_name):
date_format = '%a %b %d %H:%M:%S %Z %Y'
date_parser = lambda x: pd.to_datetime(x, format=date_format)
tic = time.time()
df = pd.read_csv(file_name,
names=['date_time'],
parse_dates=['date_time'],
date_parser=date_parser,
nrows=50000)
print('Time to read: {}'.format(time.time() - tic))
tic = time.time()
date_format = '%Y-%m-%d'
df['date'] = df['date_time'].dt.strftime(date_format)
print('Time to convert to string: {}'.format(time.time() - tic))
return df
# ------------------------------------------------------------------------------
print('No multiprocessing')
tic = time.time()
df = test_timing('sample_data.csv')
# print(df.head())
print('Total time: {} s'.format(time.time() - tic))
print()
# ------------------------------------------------------------------------------
print('With multiprocessing')
tic = time.time()
pool = multiprocessing.Pool(1)
df = pool.map(test_timing, ['sample_data.csv'])
# print(df[0].head())
print('Total time: {} s'.format(time.time() - tic))