I am looking for ways to improve my memory hogging code.
The last piece of the following section of code is an attempt to create separate columns from datetime columns. It does work, however, as I type this it is taking 1.2GB of RAM while processing (in a separate execution cell in Jupyter from the code that is loading the data). The data set is large (by my standards... 4m rows) so I don't expect it to be cheap to execute. However, during one execution I received a Memory Error.
I am certain that this is not the most efficient way to achieve this, however, I am not sure I know how to improve it. Is it possible to create these values at the time that the data set is being loaded and would that be less resource intensive? Or is there just a better way altogether to accomplish this?
I am also open to any suggestions to improve the top part of the code as well. It is probably not efficient either.
Code used to load the data set:
files = os.listdir("./source_files/")
li = []
delimiter = ';'
shrtwkday = CategoricalDtype(categories=['Sun','Mon','Tue','Wed','Thu','Fri','Sat'], ordered=True)
use_cols = ['duration_sec','start_time','end_time','start_station_id','start_station_name','start_station_latitude',
'start_station_longitude','end_station_id','end_station_name','end_station_latitude',
'end_station_longitude','bike_id','user_type']
dtypes = {'duration_sec':'int','start_time':'str','end_time':'str','start_station_id':pd.Int64Dtype(),'start_station_name':'str',
'start_station_latitude':'float','start_station_longitude':'float','end_station_id':pd.Int64Dtype(),
'end_station_name':'str','end_station_latitude':'float','end_station_longitude':'float',
'bike_id':'int','user_type':'str'}
parse_dates = ['start_time','end_time']
for file in files:
with open("./source_files/" + file, 'rb') as f_in:
df = pd.read_csv(f_in, compression='zip', delimiter=delimiter, usecols = use_cols, dtype=dtypes,
header=0, parse_dates=parse_dates, na_values=['', ' '])
li.append(df)
trip_df = pd.concat(li, axis=0, ignore_index=True)
Code in question:
trip_df['start_date'] = trip_df.start_time.dt.strftime('%Y-%m-%d')
trip_df['start_year'] = trip_df.start_time.dt.strftime('%Y').astype(int)
trip_df['start_month'] = trip_df.start_time.dt.strftime('%m').astype(int)
trip_df['start_hour'] = trip_df.start_time.dt.strftime('%H').astype(int)
trip_df['start_weekday'] = trip_df.start_time.dt.strftime('%a').astype(shrtwkday)
trip_df['end_date'] = trip_df.end_time.dt.strftime('%Y-%m-%d')
trip_df['end_year'] = trip_df.end_time.dt.strftime('%Y').astype(int)
trip_df['end_month'] = trip_df.end_time.dt.strftime('%m').astype(int)
trip_df['end_hour'] = trip_df.end_time.dt.strftime('%H').astype(int)
trip_df['end_weekday'] = trip_df.end_time.dt.strftime('%a').astype(shrtwkday)
trip_df['start_date'] = pd.to_datetime(trip_df['start_date'], format = "%Y-%m-%d")
trip_df['end_date'] = pd.to_datetime(trip_df['start_date'], format = "%Y-%m-%d")
EDIT:
To resolve the memory issue, I chose to sample 10% of the data and performed the heavy processing only on the sampled data. Also, rather than saving the sampled data into a new DataFrame, I used the same dataframe which released the 90% from memory which also helped.