1

Dear all,
I am beginner in Python. I am looking for the best way to do the following in Python: let's assume I have three text files, each one with m rows and n columns of numbers, name file A, B, and C. For the following, the contents can be indexed as A[i][j], or B[k][l] and so on. I need to compute the average of A[0][0], B[0][0], C[0][0], and writes it to file D at D[0][0]. And the same for the remaining records. For instance, let's assume that :

A:  
1 2 3   
4 5 6  
B:  
0 1 3  
2 4 5  
C:  
2 5 6  
1 1 1

Therefore, file D should be

D:  
1     2.67   4    
2.33  3.33   4  

My actual files are of course larger than the present ones, of the order of some Mb. I am unsure about the best solution, if reading all the file contents in a nested structure indexed by filename, or trying to read, for each file, each line and computing the mean. After reading the manual, the fileinput module is not useful in this case because it does not read the lines "in parallel", as I need here, but it reads the lines "serially". Any guidance or advice is highly appreciated.

Björn Pollex
  • 75,346
  • 28
  • 201
  • 283
iluvatar
  • 872
  • 10
  • 21

3 Answers3

1

Have a look at numpy. It can read the three files into three arrays (using fromfile), calculate the average and export it to a text file (using tofile).

import numpy as np


a = np.fromfile('A.csv', dtype=np.int)   
b = np.fromfile('B.csv', dtype=np.int)   
c = np.fromfile('C.csv', dtype=np.int)   

d = (a + b + c) / 3.0

d.tofile('D.csv')

Size of "some MB" should not be a problem.

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
eumiro
  • 207,213
  • 34
  • 299
  • 261
  • thanks for your help! This is the power of python I want to exploit! I have just tried the np.fromfile function, but it does not read the numbers correctly. It seems that a better alternative is the function np.loadtext (my files are just txt files). Thanks again. – iluvatar Nov 11 '10 at 22:10
  • @user505047 - you're right, fromtxt is the right choice. I am glad you enjoy discovering numpy. It can save a lot off headache and time when manipulating large numeric arrays just like here. – eumiro Nov 12 '10 at 05:33
0

In case of text files, try this:

def readdat(data,sep=','):
    step1 = data.split('\n')
    step2 = []
    for index in step1:
        step2.append(float(index.split(sep)))
    return step2

def formatdat(data,sep=','):
    step1 = []
    for index in data:
        step1.append(sep.join(str(data)))
    return '\n'.join(step1)

and then use these functions to format the text into lists.

Eric Pauley
  • 1,709
  • 1
  • 20
  • 30
0

Just for reference, here's how you'd do the same sort of thing without numpy (less elegant, but more flexible):

files = zip(open("A.dat"), open("B.dat"), open("C.dat"))
outfile = open("D.dat","w")
for rowgrp in files:     # e.g.("1 2 3\n", "0 1 3\n", "2 5 6\n")
    intsbyfile = [[int(a) for a in row.strip().split()] for row in rowgrp]
                         # [[1,2,3], [0,1,3], [2,5,6]]
    intgrps = zip(*intsbyfile) # [(1,0,2), (2,1,5), (3,3,6)]
    # use float() to ensure we get true division in Python 2.
    averages = [float(sum(intgrp))/len(intgrp) for intgrp in intgrps]
    outfile.write(" ".join(str(a) for a in averages) + "\n")

In Python 3, zip will only read the files as they are needed. In Python 2, if they're too big to load into memory, use itertools.izip instead.

Thomas K
  • 39,200
  • 7
  • 84
  • 86