I have a large CSV file which I need to manipulate w/ Pandas. Reading the documentation, I see that choosing engine='c'
could help speed up the process. The documentation notes that to use this feature, you must declare the data types. So here is my first attempt:
data_types = {
'AccountNumber': object,
'AccountName': object,
'StatsDate': datetime,
'InvoiceDate': datetime,
'Product': object,
'LineItem': object,
'Charges': np.float64
}
df = pd.read_csv('accounting_worksheet.csv',
engine='c',
dtype=data_types,
encoding='utf-8')
However, Pandas did not recognize the items declared as datetime:
In [10]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3669868 entries, 0 to 3669867
Data columns (total 9 columns):
AccountNumber object
AccountName object
StatsDate object
InvoiceDate object
Product object
LineItem object
Charges float64
dtypes: float64(1), object(6)
So digging around on StackOverflow I found two posts that seemed helpful and tried this instead:
import datetime
def parse_dates(x):
return datetime.strptime(x,'%m/%d/%Y')
data_types = {
'AccountNumber': object,
'AccountName': object,
'StatsDate': datetime,
'InvoiceDate': datetime,
'Product': object,
'LineItem': object,
'Charges': np.float64
}
df = pd.read_csv('accounting_worksheet.csv',
engine='c',
dtype=data_types,
parse_dates=['StatsDate','InvoiceDate'],
date_parser=parse_dates,
encoding='utf-8')
This time, I get the correct dtypes, but it took a few minutes to import the file:
In [14]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3669868 entries, 0 to 3669867
Data columns (total 9 columns):
AccountNumber object
AccountName object
StatsDate datetime64[ns]
InvoiceDate datetime64[ns]
Product object
LineItem object
Charges float64
dtypes: datetime64[ns](2), float64(1), object(4)
For now, I've resorted to the first method outlined here, then after importing running this:
df['StatsDate'] = pd.to_datetime(df['StatsDate'],format='%m/%d/%Y')
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'],format='%m/%d/%Y')
This file will continue to grow in size, so I'm looking for the most optimized way of doing this. Is there a more efficient solution?