0

I am a newbie in Python. I have a huge csv file of 18 GB and 48 million records. Each record is of 37 Dimensional Vector, recorded at ~1700 Hz. What I am trying to do is applying a sliding window over it using this approach. And for each window I am calculating simple mean and variance over that data. For smaller size of data it was fine. But once I tried to calculate it over my actual file it is taking ages. I am using following code:

This code is to subclass the list, to add functionality like deque.maxlen

max_list_size = 3015000  # for samples in 30 mins
sliding_factor = 1005000 # for samples in 10 mins

class L(list):
    def append(self, item):
        global max_list_size
        list.append(self, item)
        if len(self) > max_list_size: self[:1]=[]

This function is to calculate mean and variance over my list

def calc_feature(mylist):
    print 'mean is ', numpy.mean(mylist)
    print 'variance is ', numpy.var(mylist)

This is reading file and calculating feature for every window

def read_mycsv (csv_filepath):
     global max_list_size, sliding_factor
     mylist = L()
     with open(csv_filepath,"rb") as f:
          reader = csv.reader(f)
          for _ in range(max_list_size):
               mylist.append(map(float,reader.next())) # filling records in list
          try:
               while 1:
                    calc_feature(mylist)
                         for _ in range(sliding_factor):
                              mylist.append(map(float,reader.next()))
          except StopIteration:
               calc_feature(mylist)

For calculation first window it took 5 mins to respond mean and variance. But it never responded for 2nd window. I am not getting what wrong I am doing. I tried to look over internet as well but I think I am finding in wrong direction.

EDIT

As suggested by @Omada I changed my data structure from list to deque and now it is working for next windows as well. I think reading each line in a loop and putting in the deque is expensive. Is there anyway to read chunk of file at once?

Community
  • 1
  • 1
Muaz
  • 57
  • 1
  • 2
  • 8
  • You'll get the best performance for this kind of task by reading the file in bigger blocks, converting those to numpy arrays, then calculating the summaries on subsets of those arrays (you'll keep the data in single memory location by this). – liborm Nov 13 '15 at 21:13
  • @liborm Yes I think reading a single line is taking too much time. I looked up for reading file in blocks but failed. Do you have any resources related to that? – Muaz Nov 14 '15 at 16:29
  • It's not the reading, but rather processing each line separately what makes it slow. You can read say 5 MB of the data a time, then use something like [pandas](http://pandas.pydata.org/pandas-docs/stable/computation.html#moving-rolling-statistics-moments) to calculate the statistic, then read the next block accounting for the overlap, rinse and repeat. – liborm Nov 14 '15 at 19:12

1 Answers1

1

Your issue is with your class L:

    if len(self) > max_list_size: self[:1]=[]

This does remove the first element from the list, but in python removing from a list is a O(n) operation. Since you are removing from the front this means the list has to shift max_list_size elements each time you do this.

Easiest way to fix this is just use a deque instead of L. Like you said, it has a maxlen property that does what you want. numpy.mean and numpy.var will work fine with a deque so you don't even need to change any other code.

Omada
  • 784
  • 4
  • 11