Calculating the mean across multiple files

Question

I'm very new to Python and I have also searched a lot to find a question similar to mine. I would like to do something similar as explained in this question Computing averages of records from multiple files with python

However, instead of taking the mean of each value (as in this example all values are numeric) I would like to take the mean for a single column, but keep all the same values for the other columns"

For example:

fileA.txt:  
0.003 0.0003 3 Active   
0.003 0.0004 1 Active  

fileB.txt:  
0.003 0.0003 1 Active   
0.003 0.0004 5 Active

and I would like to generate the following output file

output.txt
0.003 0.0003 2 Active   
0.003 0.0004 3 Active

Although columns 1 and 2 are numeric too, they will be the same value for a same position across 100 files. So I'm only interested in the mean value for each element across 100 files for column 3.

Also, although the code in the question Computing averages of records from multiple files with python works for reading my files. It is not useful if you have lots of files. How can I optimize that?

I manage to read my files using the following code:

import numpy as np

result = []
for i in my_files:
    a = np.array(np.loadtxt(i, dtype = str, delimiter = '\t', skiprows = 1))
    result.append(a)
result = np.array(result)

I have used a similar code suggested in this question initialize a numpy array

Each of my files will the about 1500 rows per 4 columns. I tried to use np.mean but it does not work probably because some of my data are string type.

Thanks in advance for your help!

To ensure all the elements are numerical, try using map, e.g. allDouble = map( lambda el : float(el), mixedTypeArray ). Also, don't set dtype to be str if you want them to be numeric. — Matthew Turner, Jul 19 '13 at 18:07
Thanks! If I don't set dtype to str, I get the following error message: **ValueError: could not convert string to float: Transposon Inactive**. How do I use this map function... I didn't understand what you mean (sorry). — Fabs, Jul 19 '13 at 18:18
Do you need the fourth column? If not see my answer below. map is a useful function that allows you to apply a function to every member of an array. The `lambda` is defining a function inside the map function. It's equivalent to `def toFloat(num): float(num); map( toFloat, arrayToBeConverted)`. It takes some time to get used to, but is very useful once you get it. — Matthew Turner, Jul 19 '13 at 18:50
yes, I do need the forth column. I will plot a graph in the end. But thanks for your answer. — Fabs, Jul 19 '13 at 19:07
sure. you could call np.loadtxt for that column separately, i.e. `activeCol = np.loadtxt(i, dtype=str, usecols = (4), ... )` — Matthew Turner, Jul 19 '13 at 21:24

unutbu · Accepted Answer · 2013-07-20T20:02:22.910

1

If you load the arrays with np.genfromtxt(..., dtype=None), then genfromtxt will guess the dtype for each column. For example, the third column will be given an integer dtype. This will make your array suitable for arithmetic. Using dtype='str' results in an array of strings, which is not suitable for arithmetic.

import csv
import numpy as np
import itertools as IT
my_files = ['fileA.txt', 'fileB.txt']

vals = None
for num, filename in enumerate(my_files, 1):
    arr = np.genfromtxt(filename, dtype=None, delimiter='\t', skiprows=1, usecols=(2,))
    print(arr)
    if vals is None:
        vals = arr
    else:
        vals += arr

meanvals = vals / num

with open(my_files[0], 'rb') as fin, open('/tmp/test.csv', 'wb') as fout:
    # skip first row
    next(fin)
    writer = csv.writer(fout, delimiter='\t', lineterminator='\n')
    for row, val in IT.izip(csv.reader(fin, delimiter='\t'), meanvals):
        row[2] = val
        writer.writerow(row)

The result, in /tmp/test.csv looks like this:

0.003   0.0003  2   Active
0.003   0.0004  3   Active

edited Jul 20 '13 at 20:02

answered Jul 19 '13 at 18:17

unutbu

842,883
184
1,785
1,677

Thanks! I tried your code and I get the following error message **IndexError: too many indices**. I don't have any idea what it might be. Any suggestions? – Fabs Jul 19 '13 at 18:43
Yes, there was a mistake. `np.genfromtxt` returns a structured array, which is 1-dimensional, not 2-dimensional. So you would get the 3rd column with `arr['f2']`, rather than `arr[:, 2]`. (The column, or field names, are labels `f0`, `f1`, etc. by default.). – unutbu Jul 19 '13 at 18:52
Thanks! Now it did work. I also tried this version using np.mean and it also works. I didn't know about this 'f1' etc :) So if I want to write the information of all the other columns should I just loop once in one of the files and copy infos of columns 1, 2 and 4 in another txt file and add this mean variable "meanvals" as the third column? No idea how to do that, but I will try to figure out! :) – Fabs Jul 19 '13 at 19:04
Sorry, when I read your answer it did not appear the full code! Thanks again!! That was very useful! I'll try to so with my full data instead of a two rows dataset! :) – Fabs Jul 19 '13 at 19:06
I have one more question... How can I save the .csv file but keeping the number as it is? For example... the first number is 0.003, when I do this code it save a csv file with the number 0.0029999999999999997. Is there a way of converting the whole array in string for example before saving? Thanks! – Fabs Jul 19 '13 at 19:44
Okay, in that case, I think the easiest way is to open `fileA.txt` with `csv.reader` and use the *strings* it reads in the output rather than the numerical values in `arr`. That will guarantee that there is no alteration of the original format due to the inexactness of floating-point representations. I've edited the post above to show what I mean. – unutbu Jul 19 '13 at 20:02
Thanks! It did work. Do I need that comma after lineterminator='\n' in the csv.writer? I did remove the comma, and it doesn't appear important. (Sorry for asking this, I'm new to Python and I just would like to understand things) :) Thanks again for your help! – Fabs Jul 20 '13 at 19:18
The comma need not be there. Python is just forgiving about its presence. :) – unutbu Jul 20 '13 at 20:02
Hi, would you recommend some online tutorials on working with these type of arrays? I was wondering if I wanna make some variations to this code, for example, taking the mean values for two (or three, etc) columns instead of one. And saving to file maybe not all the columns for example... I did try to take the mean on two columns any everything I tried, did not work :) Thanks!! – Fabs Jul 22 '13 at 17:59
There is no easy way to learn NumPy. It just takes time. Things to do include going through the [Tentative NumPy Tutorial](http://wiki.scipy.org/Tentative_NumPy_Tutorial) and the [Example List](http://wiki.scipy.org/Numpy_Example_List). I've found keeping a file a simple examples which demonstrate each NumPy function to be very helpful. Reading documentation is too passive. The only way to learn programming is to write and play with code. You can also search stackoverflow, or google for examples, and if that does not work, post a new question to Stackoverflow. – unutbu Jul 22 '13 at 18:23
Hi, I was just wondering.. if I want to keep the value generated in line "meanvals = vals / num" as a float number, how should I convert it? I tried float() but it does not work. For example, imagine my above example. If the average of the number was a decimal number, this code round the number. Instead of saving in the file the decimal number, it saves an int number. (this is because my imput is a int rathen than a float.) Thanks! – Fabs Sep 26 '13 at 15:42
I think I figure it out. I used: vals = arr.astype(float) and vals += arr.astype(float) and it seems to work :) Thanks! – Fabs Sep 26 '13 at 15:53

Matthew Turner · Answer 2 · 2013-07-19T18:55:55.117

There's another keyword arg in np.loadtxt: usecols. Try using that, e.g.

a = np.loadtxt(i, usecols = (0,1,2), delimiter = '\t', skiprows = 1)

You don't need the np.array since np.importtxt returns an ndarray. I omitted the dtype=str because the default is dtype=float, which should do fine for you if you want to calculate the mean.

Also, instead of creating an array of arrays, if you just want to calculate the mean in each file, I'd suggest you do that within the for loop and just save the result of that calculation.

Calculating the mean across multiple files

2 Answers2