0

I tried to opend a csv.gzip file with Dask in Python. I will explain my code step by step.

First off I open the file with dask.dataframe.read_csv.At this step I specify the dtype and transform 'Date[G]','Time[G]'to a single column.

dtype_dict= {'#RIC': 'str', 'Price': 'float', 'Volume': 'float'} 
df=dd.read_csv(f, compression='gzip',header=0, sep=',',
           quotechar='"',usecols=['#RIC','Date[G]','Time[G]','Price','Volume'],
                                  blocksize=None,parse_dates=[['Date[G]','Time[G]']],dtype=dtype_dict)

After that I drop all NA in 'Price','Volume'columns and set the combined column 'Date[G]_Time[G]'as index without drop the column since I still need it later.

df= df.dropna(subset=['Price','Volume'])
df=df.set_index('Date[G]_Time[G]', drop=False)

Then I tried to split that 'Date[G]_Time[G]'column again since my output file needs date and time in two separate columns. I know there must be a better way to handle this, I just cannot find it.

df['Date[G]'] = dd.to_datetime(df['Date[G]_Time[G]']).dt.date
df['Time[G]'] = dd.to_datetime(df['Date[G]_Time[G]']).dt.time
df=df.drop(['Date[G]_Time[G]'],axis=1)

After that, I append that data frame to a list. I have a bunch of csv.gz files and I want to open all of them and then repartition this big data frame with a frequency of the calendar year.

dl=[]
df_concated=dl.append(df)
df_concated.repartition(freq='A')

I know under default dask can be really slow I just don't know how to set it which make me really upset. Does anyone know how to optimize my code?

Sample data.

 #RIC   Date[G]   Time[G]         Price Volume
 VZC.L 2014-05-01 06:16:00.480000 46.64 88.0
 VZC.L 2014-05-01 06:16:00.800000 46.64 33.0
 VZC.L 2014-05-01 06:16:00.890000 46.69 20.0
 VZC.L 2014-05-01 06:16:00.980000 46.69 40.0
 VZC.L 2014-05-01 06:16:01.330000 46.67 148.0
Chuan Wang
  • 35
  • 6

1 Answers1

0

The problem may be with the read_csv param parse_dates.

parse_dates=[['Date[G]','Time[G]']]

Try to load the file without parse_dates as an object type and then convert the field to datetime.
see this answer

skibee
  • 1,279
  • 1
  • 17
  • 37