0

I have a large number of time series (millions) of varying length that I plan to do a clustering analysis on (probably using the sklearn implementation of kmeans).

For my purposes I need to align the time series (such that the maximum value is centered, pad them with zeros (so they are all the same length), and normalize them before I can do the clustering analysis. So as a trivial example, something like:

[5, 0, 7, 10, 6]

Would become something like

[0, 0.5, 0, 0.7, 1, 0.6, 0, 0, 0]

In the real data, the raw time series are of length 90, and the padded/aligned/normed time series are of length 181. Of course, we have lots of zeros here, so a sparse matrix seems the ideal way of storing the data.

Based on this, I have two related questions:

1 - How best to store these in memory? My current, inefficient method is to calculate the dense normed/aligned/padded matrix for each time series and write to a simple text file for storage purposes, then separately read that data into a scipy sparse lil matrix:

rows, columns = N, 181
matrix = scipy.sparse.lil_matrix( (rows, columns) )

for i,line in enumerate(open(file_containing_dense_matrix_data)):
    # The first two values in each line are metadata
    line = map(float,line.strip().split(',')[2:])

matrix[i]=line

This is both slow and more memory intensive than I had hoped. Is there a preferred method?

2 - Is there a better way to store the time series on disk? I have yet to find an efficient means to write the data to disk directly as a sparse matrix that I can read (relatively) quickly into memory at a later time.

My ideal response here is a method that addresses both questions, i.e. a method to store the dense matrix rows directly into a sparse data structure, and to efficiently read/write the data to/from disk.

moustachio
  • 2,924
  • 3
  • 36
  • 68

1 Answers1

0

I would recommend using the pandas support for sparse matrixes, and then its IO tools to write e.g. to HDFS.

logc
  • 3,813
  • 1
  • 18
  • 29
  • Ha! I didn't even realize pandas had sparse matrix for support. This is excellent! BUT, do you happen to know the proper way of handling iterative additions to the dataframe? I'm adding the rows one at a time, but don't want to build the whole array and then convert to sparse (too much memory). Can I add sparse series as rows (`df = df.append(series.to_sparse()`), or do I have to re "sparsify" the matrix each loop though (i.e. `df = df.append(series).to_sparse()`)? – moustachio Apr 09 '14 at 17:44
  • @moustachio: no, I really don't know about that. I would suggest you use [Numpy sparse arrays and their `vstack` method](http://docs.scipy.org/doc/numpy/reference/generated/numpy.vstack.html#numpy.vstack) to grow the matrix incrementally, and then populate a sparse Pandas data frame as described [in this other SO question](http://stackoverflow.com/questions/17818783/populate-a-pandas-sparsedataframe-from-a-scipy-sparse-matrix) – logc Apr 09 '14 at 18:01
  • For storing of sparse matrices to HDFS, see http://stackoverflow.com/a/22589030/2858145 . – Pietro Battiston Apr 16 '14 at 21:48