I have a large number of time series (millions) of varying length that I plan to do a clustering analysis on (probably using the sklearn implementation of kmeans).
For my purposes I need to align the time series (such that the maximum value is centered, pad them with zeros (so they are all the same length), and normalize them before I can do the clustering analysis. So as a trivial example, something like:
[5, 0, 7, 10, 6]
Would become something like
[0, 0.5, 0, 0.7, 1, 0.6, 0, 0, 0]
In the real data, the raw time series are of length 90, and the padded/aligned/normed time series are of length 181. Of course, we have lots of zeros here, so a sparse matrix seems the ideal way of storing the data.
Based on this, I have two related questions:
1 - How best to store these in memory? My current, inefficient method is to calculate the dense normed/aligned/padded matrix for each time series and write to a simple text file for storage purposes, then separately read that data into a scipy sparse lil matrix:
rows, columns = N, 181
matrix = scipy.sparse.lil_matrix( (rows, columns) )
for i,line in enumerate(open(file_containing_dense_matrix_data)):
# The first two values in each line are metadata
line = map(float,line.strip().split(',')[2:])
matrix[i]=line
This is both slow and more memory intensive than I had hoped. Is there a preferred method?
2 - Is there a better way to store the time series on disk? I have yet to find an efficient means to write the data to disk directly as a sparse matrix that I can read (relatively) quickly into memory at a later time.
My ideal response here is a method that addresses both questions, i.e. a method to store the dense matrix rows directly into a sparse data structure, and to efficiently read/write the data to/from disk.