1

Since functions are first-class citizens in Python I should be able to refactor this:

def get_events():
    csv_path = os.path.join(INPUT_CSV_PATH, DATA_NAME + '.csv')
    print(f'Starting reading events at {datetime.now()}')
    start_time = datetime.now()
    events = pd.read_csv(csv_path, dtype=DTYPES)
    end_time = datetime.now()
    print(f'Finished reading events at {end_time} ({end_time - start_time})')
    return events

To something like this:

def get_events():
    csv_path = os.path.join(INPUT_CSV_PATH, DATA_NAME + '.csv')
    events = _time_function_call('reading events', pd.read_csv, {'filepath_or_buffer': csv_path, 'dtype': DTYPES})
    return events

def _time_function_call(message, func, *kwargs):
    print(f'Starting {message} at {datetime.now()}')
    start_time = datetime.now()
    result = func(*kwargs)
    end_time = datetime.now()
    print(f'Finished {message} at {end_time} ({end_time - start_time})')
    return result

I.e. pass the pandas read_csv function and its named arguments into a helper function. (N.B. I wasn't sure how to pass in the named arguments when passing a function around, this answer helped.)

But after refactoring I get the following error:

ValueError: Invalid file path or buffer object type: <class 'dict'>

What am I missing about how to pass functions and their named parameters into another Python function for evaluation?

dumbledad
  • 16,305
  • 23
  • 120
  • 273

1 Answers1

2

you probably want to refactor into:

def get_events():
    csv_path = os.path.join(INPUT_CSV_PATH, DATA_NAME + '.csv')
    events = _time_function_call('reading events', pd.read_csv, filepath_or_buffer=csv_path, dtype=DTYPES)
    return events

def _time_function_call(message, func, *args, **kwds):
    start_time = datetime.now()
    print(f'Starting {message} at {start_time}')
    result = func(*args, **kwds)
    end_time = datetime.now()
    duration = end_time - start_time
    print(f'Finished {message} at {end_time} ({duration})')
    return result

that way Python can take care of handling arbitrary argument lists.

I'd suggest using context managers and the logging module as it's much easier to get code like this to compose nicely, e.g:

from time import perf_counter
import logging

logger = logging.getLogger(__name__)

class log_timer:
    def __init__(self, message):
        self.message = message

    def __enter__(self):
        logger.info(f"{self.message} started")
        # call to perf_counter() should be the last statement in method
        self.start_time = perf_counter()

    def __exit__(self, exc_type, exc_value, traceback):
        # perf_counter() call should be first statement
        secs = perf_counter() - self.start_time
        state = 'finished' if exc_value is None else 'failed'
        logger.info(f"{self.message} {state} after {secs * 1000:.2f}ms")

which can be used like:

from time import sleep

logging.basicConfig(
    format='%(asctime)s %(levelname)s %(message)s',
    level=0,
)

with log_timer("sleep"):
    sleep(1)

that way you don't have to worry about making arbitrary bits of code into functions and threading state between them.

also, using datetime as you were isn't great for measuring runtime of small bits of code, the time module provides perf_counter which hands of to a more appropriate OS/CPU (higher-resolution) timer.

Sam Mason
  • 15,216
  • 1
  • 41
  • 60
  • Thanks, that worked a treat. I could even drop the annoying `filepath_or_buffer=` from the call (no one ever names that parameter). For the timing stuff I'll look through your suggestions, that's useful. Unfortunately it may just be short lines of code but they take more than five minutes to run so a low-resolution timer is OK. – dumbledad Jun 24 '19 at 12:42
  • if it's taking that long, I'd look into splitting the file up into several smaller files outside python, otherwise you could use something like [`chunksize`](http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-chunking) or even use something like [`dask`](https://docs.dask.org/en/latest/dataframe.html) – Sam Mason Jun 24 '19 at 12:55