1

I have a large CSV file which I need to manipulate w/ Pandas. Reading the documentation, I see that choosing engine='c' could help speed up the process. The documentation notes that to use this feature, you must declare the data types. So here is my first attempt:

data_types = {
    'AccountNumber':       object,
    'AccountName':         object,
    'StatsDate':           datetime,
    'InvoiceDate':         datetime,
    'Product':             object,
    'LineItem':            object,
    'Charges':             np.float64
}
df = pd.read_csv('accounting_worksheet.csv',
    engine='c',
    dtype=data_types,
    encoding='utf-8')

However, Pandas did not recognize the items declared as datetime:

In [10]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3669868 entries, 0 to 3669867
Data columns (total 9 columns):
AccountNumber         object
AccountName           object
StatsDate             object
InvoiceDate           object
Product               object
LineItem              object
Charges               float64
dtypes: float64(1), object(6)

So digging around on StackOverflow I found two posts that seemed helpful and tried this instead:

import datetime
def parse_dates(x):
    return datetime.strptime(x,'%m/%d/%Y')

data_types = {
    'AccountNumber':       object,
    'AccountName':         object,
    'StatsDate':           datetime,
    'InvoiceDate':         datetime,
    'Product':             object,
    'LineItem':            object,
    'Charges':             np.float64
}
df = pd.read_csv('accounting_worksheet.csv',
    engine='c',
    dtype=data_types,
    parse_dates=['StatsDate','InvoiceDate'],
    date_parser=parse_dates,
    encoding='utf-8')

This time, I get the correct dtypes, but it took a few minutes to import the file:

In [14]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3669868 entries, 0 to 3669867
Data columns (total 9 columns):
AccountNumber         object
AccountName           object
StatsDate             datetime64[ns]
InvoiceDate           datetime64[ns]
Product               object
LineItem              object
Charges               float64
dtypes: datetime64[ns](2), float64(1), object(4)

For now, I've resorted to the first method outlined here, then after importing running this:

df['StatsDate'] = pd.to_datetime(df['StatsDate'],format='%m/%d/%Y')
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'],format='%m/%d/%Y')

This file will continue to grow in size, so I'm looking for the most optimized way of doing this. Is there a more efficient solution?

Community
  • 1
  • 1
measureallthethings
  • 1,102
  • 10
  • 26
  • how does it perform if you remove the `datetime` dtype mapping and do this: `df = pd.read_csv('accounting_worksheet.csv', engine='c', dtype=data_types, parse_dates=['StatsDate','InvoiceDate'], encoding='utf-8')`? it should be able to sniff the format and as `dayfirst=False` by default it should handle your format – EdChum Feb 25 '16 at 16:53
  • 2
    `infer_datetime_format=True` may help as well. There's some overhead, but if the file is large enough that overhead is more than made up for. – TomAugspurger Feb 25 '16 at 16:54
  • @EdChum...i tried what you suggested and it's been 10 minutes and still running :-/ @TomAugspurger tried `infer` and got `CPU times: user 28.1 s, sys: 782 ms, total: 28.9 s Wall time: 30.8 s` which is definite improvement over option #2 i showed in my question – measureallthethings Feb 25 '16 at 17:13
  • So you got like a 20x speedup from infer? Nice! That's potentially going to become the default. – TomAugspurger Feb 25 '16 at 17:19
  • yes, a definite improvement. however, i should have been more precise...option #2 i listed was `CPU times: user 1min 18s, sys: 1.25 s, total: 1min 19s Wall time: 1min 21s`. your suggestion is an improvement so it will be my default, thanks! – measureallthethings Feb 25 '16 at 17:22

0 Answers0