I am a newbie in Python. I have a huge csv file of 18 GB and 48 million records. Each record is of 37 Dimensional Vector, recorded at ~1700 Hz. What I am trying to do is applying a sliding window over it using this approach. And for each window I am calculating simple mean
and variance
over that data. For smaller size of data it was fine. But once I tried to calculate it over my actual file it is taking ages. I am using following code:
This code is to subclass the list, to add functionality like deque.maxlen
max_list_size = 3015000 # for samples in 30 mins
sliding_factor = 1005000 # for samples in 10 mins
class L(list):
def append(self, item):
global max_list_size
list.append(self, item)
if len(self) > max_list_size: self[:1]=[]
This function is to calculate mean and variance over my list
def calc_feature(mylist):
print 'mean is ', numpy.mean(mylist)
print 'variance is ', numpy.var(mylist)
This is reading file and calculating feature for every window
def read_mycsv (csv_filepath):
global max_list_size, sliding_factor
mylist = L()
with open(csv_filepath,"rb") as f:
reader = csv.reader(f)
for _ in range(max_list_size):
mylist.append(map(float,reader.next())) # filling records in list
try:
while 1:
calc_feature(mylist)
for _ in range(sliding_factor):
mylist.append(map(float,reader.next()))
except StopIteration:
calc_feature(mylist)
For calculation first window it took 5 mins to respond mean and variance. But it never responded for 2nd window. I am not getting what wrong I am doing. I tried to look over internet as well but I think I am finding in wrong direction.
EDIT
As suggested by @Omada I changed my data structure from list
to deque
and now it is working for next windows as well. I think reading each line in a loop and putting in the deque
is expensive. Is there anyway to read chunk of file at once?