How to improve the memory efficiency of this code?

Question

I am looking for ways to improve my memory hogging code.

The last piece of the following section of code is an attempt to create separate columns from datetime columns. It does work, however, as I type this it is taking 1.2GB of RAM while processing (in a separate execution cell in Jupyter from the code that is loading the data). The data set is large (by my standards... 4m rows) so I don't expect it to be cheap to execute. However, during one execution I received a Memory Error.

I am certain that this is not the most efficient way to achieve this, however, I am not sure I know how to improve it. Is it possible to create these values at the time that the data set is being loaded and would that be less resource intensive? Or is there just a better way altogether to accomplish this?

I am also open to any suggestions to improve the top part of the code as well. It is probably not efficient either.

Code used to load the data set:

files = os.listdir("./source_files/")

li = []

delimiter = ';'

shrtwkday = CategoricalDtype(categories=['Sun','Mon','Tue','Wed','Thu','Fri','Sat'], ordered=True)

use_cols = ['duration_sec','start_time','end_time','start_station_id','start_station_name','start_station_latitude',
            'start_station_longitude','end_station_id','end_station_name','end_station_latitude',
            'end_station_longitude','bike_id','user_type']

dtypes = {'duration_sec':'int','start_time':'str','end_time':'str','start_station_id':pd.Int64Dtype(),'start_station_name':'str',
          'start_station_latitude':'float','start_station_longitude':'float','end_station_id':pd.Int64Dtype(),
          'end_station_name':'str','end_station_latitude':'float','end_station_longitude':'float',
          'bike_id':'int','user_type':'str'}

parse_dates = ['start_time','end_time']



for file in files:
    with open("./source_files/" + file, 'rb') as f_in:
        df = pd.read_csv(f_in, compression='zip', delimiter=delimiter, usecols = use_cols, dtype=dtypes, 
                         header=0, parse_dates=parse_dates, na_values=['', ' '])
        li.append(df)

trip_df = pd.concat(li, axis=0, ignore_index=True)

Code in question:

trip_df['start_date'] = trip_df.start_time.dt.strftime('%Y-%m-%d')
trip_df['start_year'] = trip_df.start_time.dt.strftime('%Y').astype(int)
trip_df['start_month'] = trip_df.start_time.dt.strftime('%m').astype(int)
trip_df['start_hour'] = trip_df.start_time.dt.strftime('%H').astype(int)
trip_df['start_weekday'] = trip_df.start_time.dt.strftime('%a').astype(shrtwkday)

trip_df['end_date'] = trip_df.end_time.dt.strftime('%Y-%m-%d')
trip_df['end_year'] = trip_df.end_time.dt.strftime('%Y').astype(int)
trip_df['end_month'] = trip_df.end_time.dt.strftime('%m').astype(int)
trip_df['end_hour'] = trip_df.end_time.dt.strftime('%H').astype(int)
trip_df['end_weekday'] = trip_df.end_time.dt.strftime('%a').astype(shrtwkday)

trip_df['start_date'] = pd.to_datetime(trip_df['start_date'], format = "%Y-%m-%d")
trip_df['end_date'] = pd.to_datetime(trip_df['start_date'], format = "%Y-%m-%d")

EDIT:

To resolve the memory issue, I chose to sample 10% of the data and performed the heavy processing only on the sampled data. Also, rather than saving the sampled data into a new DataFrame, I used the same dataframe which released the 90% from memory which also helped.

I'm not really following what you're trying to accomplish, overall. But you seem to be loading a lot of files and appending them into a list. Do you really need to have all those rows of data in memory at once? — user2740650, Jan 18 '20 at 23:38
one issue is that you are also holding every single data set in memory, because once you finish processing you append that to a list, so if the operation is expensive, it will be `n` times expensive where `n == len(files)` — gold_cy, Jan 18 '20 at 23:38
After loading I am taking a random sample, but since the dataset is split across the multiple files, it seems the only way to be truly random is to load the entire dataset and then sample it. Your comment did make it obvious that since I am only going to be working with a sample, I could do the datetime split on the sample... that will resolve my memory issue on the second set of code. I am still curious about whether this method (which I did research) is the best way to perform the column creation piece. — J. Brown, Jan 18 '20 at 23:43
Oh, then I guess the data you're sampling is a tiny subset of the overall data. Let's say you sample 10 out of a million rows. Instead of loading all million rows into memory and then selecting 10, instead build up a list of 10 rows by reading and discarding all but one. Eg. read a file, record one random row, and throw away the file. — user2740650, Jan 19 '20 at 00:01
@user2740650 This would not be truly a random sample of the total data set. It would be a collection of random samples from the sub-sets, but not a random sample of the total data set (for example if one file has significantly more data than another file, it should not be sampled at the same sample rate as others). I have resolved my memory issue. Fortunately, I have the memory to hold the data set temporarily and I moved the date split process to after the sample. I am concluding that the second part of the code is as good as it gets. — J. Brown, Jan 19 '20 at 00:12
OK. Yes, I was oversimplifying just to illustrate the idea. In reality you'd have to adjust based on the number of lines in a given file. Or (simpler) choose the indices of the rows you're going to sample before you even start reading the first file. That assumes you have some idea of the overall number of rows. — user2740650, Jan 19 '20 at 00:24
I had the same thought about pre-selecting the indices and also the same thought regarding the fact that i would not know the number of rows in advance. If memory were an issue vs processing time, I could loop through to count the indices (without storing), do the random selection from that info, then loop again selecting only the sample rows. I have solutions to my immediate problems. If one of you want to provide your comments as an answer I will give you credit for the assist. — J. Brown, Jan 19 '20 at 01:53
@J.Brown: if someone has given the answer in comments and chooses not to write it up, you are most welcome to self-answer it using their hints. You may even get upvote points for it, and that is fine too. — halfer, Jan 20 '20 at 22:03
For choosing random rows without knowing the row count in advance, you could probably adapt https://stackoverflow.com/a/232248/5987. — Mark Ransom, Jan 20 '20 at 23:24

score 0 · Answer 1 · answered Jan 21 '20 at 22:58

Thanks to the comments, I arrived at a solution that resolved my memory issues.

To resolve the memory issue, I chose to sample 10% of the data and performed the heavy processing only on the sampled data. Also, rather than saving the sampled data into a new DataFrame, I used the same dataframe which released the 90% from memory which also helped.

How to improve the memory efficiency of this code?

1 Answers1