Calculating an average for every X number of lines

Question

I am trying to take data from a text file and calculate an average for every 600 lines of that file. I'm loading the text from the file, putting it into a numpy array and enumerating it. I can get the average for the first 600 lines but I'm not sure how to write a loop so that python calculates an average for every 600 lines and then puts this into a new text file. Here is my code so far:

import numpy as np

#loads file and places it in array
data = np.loadtxt('244UTZ10htz.txt', delimiter = '\t', skiprows = 2)
shape = np.shape(data)

#creates array for u wind values
for i,d in enumerate(data):
    data[i] = (d[3])
    if i == 600:
        minavg = np.mean(data[i == 600])

#finds total u mean for day
ubar = np.mean(data)

I don't think data[i == 600] does what you think it does. It's basically the same as data[1] in this case. — M4rtini, Mar 17 '14 at 19:37
What is the shape of `data`, and do you want the average of every column, or just the fourth? — unutbu, Mar 17 '14 at 19:37

mdadm · Answer 1 · 2014-03-18T16:43:45.817

Based on what I understand from your question, it sounds like you have some file that you want to take the mean of every line up to the 600th one, and repeat that multiple times till there is no more data. So at line 600 you average lines 0 - 600, at line 1200 you average lines 600 to 1200.

Modulus division would be one approach to taking the average when you hit every 600th line, without having to use a separate variable to keep count how many lines you've looped through. Additionally, I used Numpy Array Slicing to create a view of the original data, containing only the 4th column out of the data set.

This example should do what you want, but it is entirely untested... I'm also not terribly familiar with numpy, so there are some better ways do this as mentioned in the other answers:

import numpy as np

#loads file and places it in array
data = np.loadtxt('244UTZ10htz.txt', delimiter = '\t', skiprows = 2)
shape = np.shape(data)
data_you_want = data[:,3]
daily_averages = list()


#creates array for u wind values
for i,d in enumerate(data_you_want):
    if (i % 600) == 0:
        avg_for_day = np.mean(data_you_want[i - 600:i])
        daily_averages.append(avg_for_day)

You can either modify the example above to write the mean out to a new file, instead of appending to a list as I have done, or just write the daily_averages list out to whatever file you want.

As a bonus, here is a Python solution using only the CSV library. It hasn't been tested much, but theoretically should work and might be fairly easy to understand for someone new to Python.

import csv 

data = list()
daily_average = list()
num_lines = 600

with open('testme.csv', 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter="\t")

    for i,row in enumerate(reader):
        if (i % num_lines) == 0 and i != 0:
            average = sum(data[i - num_lines:i]) / num_lines
            daily_average.append(average)

        data.append(int(row[3]))

Hope this helps!

With slicing there is no modulus division required. Try: `np.arange(5)[3:5000]`. You really should update the answer- iterating through the entire array, even if nothing is done at every iteration, is not a good way to do this. — Daniel, Mar 17 '14 at 20:28
@Ophion I wrote an answer already which uses array slicing. Any feedback +ve or -ve would be welcome. — TooTone, Mar 17 '14 at 20:40
Thanks guys. I must admit, I don't know much about Numpy so you taught me something new. I updated the answer. I kept the modulus solution since it seems so few new programmers know about it. — mdadm, Mar 18 '14 at 16:44

Trond Kristiansen · Answer 2 · 2014-03-17T20:12:48.960

0

Simple solution would be:

import numpy as np
data = np.loadtxt('244UTZ10htz.txt', delimiter = '\t', skiprows = 2)
mydata=[]; counter=0
for i,d in enumerate(data):
   mydata.append((d[3]))

    # Find the average of the previous 600 lines
   if counter == 600:
      minavg = np.mean(np.asarray(mydata))

      # reset the counter and start counting from 0
      counter=0; mydata=[]
   counter+=1

edited Mar 17 '14 at 20:12

answered Mar 17 '14 at 19:40

Trond Kristiansen

2,379
23
48

I assume it should be `mydata = []` at the second to last line there – M4rtini Mar 17 '14 at 19:52

score 0 · Answer 3 · edited May 23 '17 at 12:07

The following program uses array slicing to get the column, and then a list comprehension indexing into the column to get the means. It might be simpler to use a for loop for the latter.

Slicing / indexing into the array rather than creating new objects also has the advantage of speed as you're just creating new views into existing data.

import numpy as np

# test data
nr = 11
nc = 3
a = np.array([np.array(range(nc))+i*10 for i in range(nr)])
print a

# slice to get column
col = a[:,1]
print col

# comprehension to step through column to get means
numpermean = 2
means = [np.mean(col[i:(min(len(col), i+numpermean))]) \
         for i in range(0,len(col),numpermean)]

print means

it prints

[[  0   1   2]
 [ 10  11  12]
 [ 20  21  22]
 [ 30  31  32]
 [ 40  41  42]
 [ 50  51  52]
 [ 60  61  62]
 [ 70  71  72]
 [ 80  81  82]
 [ 90  91  92]
 [100 101 102]]
[  1  11  21  31  41  51  61  71  81  91 101]
[6.0, 26.0, 46.0, 66.0, 86.0, 101.0]

score 0 · Answer 4 · answered Mar 17 '14 at 20:30

0

Something like this works. Maybe not that readable. But should be fairly fast.

n = int(data.shape[0]/600)
interestingData = data[:,3]
daily_averages =  np.mean(interestingData[:600*n].reshape(-1, 600), axis=1)

answered Mar 17 '14 at 20:30

M4rtini

13,186
4
35
42

Calculating an average for every X number of lines

4 Answers4

Linked