Computing averages of records from multiple files with python

Question

Dear all,
I am beginner in Python. I am looking for the best way to do the following in Python: let's assume I have three text files, each one with m rows and n columns of numbers, name file A, B, and C. For the following, the contents can be indexed as A[i][j], or B[k][l] and so on. I need to compute the average of A[0][0], B[0][0], C[0][0], and writes it to file D at D[0][0]. And the same for the remaining records. For instance, let's assume that :

Therefore, file D should be

D:  
1     2.67   4    
2.33  3.33   4

My actual files are of course larger than the present ones, of the order of some Mb. I am unsure about the best solution, if reading all the file contents in a nested structure indexed by filename, or trying to read, for each file, each line and computing the mean. After reading the manual, the fileinput module is not useful in this case because it does not read the lines "in parallel", as I need here, but it reads the lines "serially". Any guidance or advice is highly appreciated.

How big are your files? Can you read them all into memory, process them, and then write out the result? — Björn Pollex, Nov 11 '10 at 21:50
@Space_C0wb0y From the question - *My actual files are of course larger than the present ones, of the order of some Mb.* — Mark Byers, Nov 11 '10 at 21:51

score 1 · Accepted Answer · edited Jan 24 '23 at 19:00

1

Have a look at numpy. It can read the three files into three arrays (using fromfile), calculate the average and export it to a text file (using tofile).

import numpy as np


a = np.fromfile('A.csv', dtype=np.int)   
b = np.fromfile('B.csv', dtype=np.int)   
c = np.fromfile('C.csv', dtype=np.int)   

d = (a + b + c) / 3.0

d.tofile('D.csv')

Size of "some MB" should not be a problem.

edited Jan 24 '23 at 19:00

Glorfindel

21,988
13
81
109

answered Nov 11 '10 at 21:51

eumiro

207,213
34
299
261

thanks for your help! This is the power of python I want to exploit! I have just tried the np.fromfile function, but it does not read the numbers correctly. It seems that a better alternative is the function np.loadtext (my files are just txt files). Thanks again. – iluvatar Nov 11 '10 at 22:10
@user505047 - you're right, fromtxt is the right choice. I am glad you enjoy discovering numpy. It can save a lot off headache and time when manipulating large numeric arrays just like here. – eumiro Nov 12 '10 at 05:33

score 0 · Answer 2 · answered Nov 11 '10 at 22:50

In case of text files, try this:

def readdat(data,sep=','):
    step1 = data.split('\n')
    step2 = []
    for index in step1:
        step2.append(float(index.split(sep)))
    return step2

def formatdat(data,sep=','):
    step1 = []
    for index in data:
        step1.append(sep.join(str(data)))
    return '\n'.join(step1)

and then use these functions to format the text into lists.

score 0 · Answer 3 · answered Nov 11 '10 at 22:57

Just for reference, here's how you'd do the same sort of thing without numpy (less elegant, but more flexible):

files = zip(open("A.dat"), open("B.dat"), open("C.dat"))
outfile = open("D.dat","w")
for rowgrp in files:     # e.g.("1 2 3\n", "0 1 3\n", "2 5 6\n")
    intsbyfile = [[int(a) for a in row.strip().split()] for row in rowgrp]
                         # [[1,2,3], [0,1,3], [2,5,6]]
    intgrps = zip(*intsbyfile) # [(1,0,2), (2,1,5), (3,3,6)]
    # use float() to ensure we get true division in Python 2.
    averages = [float(sum(intgrp))/len(intgrp) for intgrp in intgrps]
    outfile.write(" ".join(str(a) for a in averages) + "\n")

In Python 3, zip will only read the files as they are needed. In Python 2, if they're too big to load into memory, use itertools.izip instead.

Computing averages of records from multiple files with python

3 Answers3

Linked