How to optimize cleaning and merge time for large files with python?

Question

I'm working with 8 different csv files. In a first step I've cleaned each file in this way:

1) file observations

obs = pd.read_csv('....csv',sep = ";")
obs = asport_obs.drop(['date', 'humidity', 'precipitation', 'station'],axis=1).dropna()
obs['time'] = obs['time'].astype('datetime64[ns]')
obs['time'] = asport_obs['time'].apply(lambda x: x.strftime('%d-%m-%Y %H'))
asport.columns = ['temperature_obs','time']

2) files previsions

prev = pd.read_csv('....csv',sep = ";")
prev = prev.drop(['cloud_cover', 'date', 'humidity', 'latitude_r', 'longitude_r', 'pressure', 'wind', 'wind_dir'],axis=1).dropna()
prev['time'] = prev['time'].astype('datetime64[ns]')
prev['time'] = prev['time'].apply(lambda x: x.strftime('%d-%m-%Y %H'))
prev.columns = ['temperature_prev','time'] 


prev2 = pd.read_csv('....csv',sep = ";")
prev2 = prev2.drop(['cloud_cover', 'date', 'humidity', 'latitude_r', 'longitude_r', 'pressure', 'wind', 'wind_dir'],axis=1).dropna()
prev2['time'] = prev2['time'].astype('datetime64[ns]')
prev2['time'] = prev2['time'].apply(lambda x: x.strftime('%d-%m-%d-%Y %H'))
prev2.columns = ['temperature_prev2','time']

...

same for 5 others files previsions

In a second time I've merged all these files by the key "time" on this format "day-month-year hour" on the left (file obs) in this way :

prevs = pd.merge(obs,prev[['time', 'temperature_apiagro']], how='left',on='time')

prevs = pd.merge(prevs,prevs2[['time', 'temperature_darksky']], how='left',on='time')

... and so on

The final size of the merged file is approximately 42 million of rows. The process to obtained the merged file is very long and now each time I try to run it, the algorithm /python breaks.

I would like to know if there is/are solution.s to optimize my code in order that its launch be faster and work without break

When you say "the algorithm /python breaks", do you mean it throws a `MemoryError`, or the computer freezes, or something else? — Leporello, Jul 03 '19 at 12:52
The exact term print on my Jupyter notebook is " the kernel crashed" — JEG, Jul 03 '19 at 12:57

score 0 · Answer 1 · answered Jul 03 '19 at 12:52

(1) Instead of loading the complete file, you can iterate over rows in the csv and clean them seperately.

This would basically go along the lines of this answer: Reading a huge .csv file

(2) Instead of opening all of them and saving them into working memory, loop over a list of paths (to the files) and save only the current one.

(3) Finally for merging have a look at pytables. Do not use a csv or pd.Dataframe but rather create a .pytable file and add the cleaned results iteratively to the file.

How to optimize cleaning and merge time for large files with python?

1 Answers1