I tried to opend a csv.gzip file with Dask in Python. I will explain my code step by step.
First off I open the file with dask.dataframe.read_csv
.At this step I specify the dtype and transform 'Date[G]','Time[G]'
to a single column.
dtype_dict= {'#RIC': 'str', 'Price': 'float', 'Volume': 'float'}
df=dd.read_csv(f, compression='gzip',header=0, sep=',',
quotechar='"',usecols=['#RIC','Date[G]','Time[G]','Price','Volume'],
blocksize=None,parse_dates=[['Date[G]','Time[G]']],dtype=dtype_dict)
After that I drop all NA in 'Price','Volume'
columns and set the combined column 'Date[G]_Time[G]'
as index without drop the column since I still need it later.
df= df.dropna(subset=['Price','Volume'])
df=df.set_index('Date[G]_Time[G]', drop=False)
Then I tried to split that 'Date[G]_Time[G]'
column again since my output file needs date and time in two separate columns. I know there must be a better way to handle this, I just cannot find it.
df['Date[G]'] = dd.to_datetime(df['Date[G]_Time[G]']).dt.date
df['Time[G]'] = dd.to_datetime(df['Date[G]_Time[G]']).dt.time
df=df.drop(['Date[G]_Time[G]'],axis=1)
After that, I append that data frame to a list. I have a bunch of csv.gz files and I want to open all of them and then repartition this big data frame with a frequency of the calendar year.
dl=[]
df_concated=dl.append(df)
df_concated.repartition(freq='A')
I know under default dask can be really slow I just don't know how to set it which make me really upset. Does anyone know how to optimize my code?
Sample data.
#RIC Date[G] Time[G] Price Volume
VZC.L 2014-05-01 06:16:00.480000 46.64 88.0
VZC.L 2014-05-01 06:16:00.800000 46.64 33.0
VZC.L 2014-05-01 06:16:00.890000 46.69 20.0
VZC.L 2014-05-01 06:16:00.980000 46.69 40.0
VZC.L 2014-05-01 06:16:01.330000 46.67 148.0