combine/average multiple data files

Question

I have a set of data files (say, e.g., "data####.dat", where #### = 0001,...,9999) which all have the common data structure with the same x-values in the first column and then a number of columns with different y-values.

data0001.dat:

#A < comment line with unique identifier 'A'
#B1 < this is a comment line that can/should be dropped
1 11 21
2 12 22
3 13 23

data0002.dat:

#A < comment line with unique identifier 'A'
#B2 < this is a comment line that can/should be dropped
1 13 23
2 12 22
3 11 21

They basically originate from different runs of my program with different seeds and I now want to combine these partial results into one common histogram such that comment lines starting with "#A" (which are identical for all files) are retained and other comment lines are dropped. The first column stays and then all other columns should be averaged over all data files:

dataComb.dat:

#A < comment line with unique identifier 'A'
1 12 22 
2 12 22 
3 12 22

where 12 = (11+13)/2 = (12+12)/2 = (13+11)/2 and 22 = (21+23)/2 = (22+22)/2 = (23+21)/2

I already have a bash script (probably horrible code; but I'm not so experienced...) that does this job by running ./merge.sh data* > dataComb.dat in the command line. It also checks if all data files have the same number of columns and the same values in the first column.

merge.sh:

#!/bin/bash

if [ $# -lt 2 ]; then
    echo "at least two files please"
    exit 1;
fi

i=1
for file in "$@"; do
    cols[$i]=$(awk '
BEGIN {cols=0}
$1 !~ /^#/ {
  if (cols==0) {cols=NF}
  else {
    if (cols!=NF) {cols=-1}
  }
}
END {print cols}
' ${file})
    i=$((${i}+1))
done

ncol=${cols[1]}
for i in ${cols[@]}; do
    if [ $i -ne $ncol ]; then
        echo "mismatch in the number of columns"
        exit 1
    fi
done

echo "#combined $# files"
grep "^#A" $1

paste "$@" | awk "
\$1 !~ /^#/ && NF>0 {
  flag=0
  x=\$1
  for (c=1; c<${ncol}; c++) { y[c]=0. }
  i=1
  while (i <= NF) {
    if (\$i==x) {
      for (c=1; c<${ncol}; c++) { y[c] += \$(i+c) }
      i+= ${ncol}
    } else { flag=1; i=NF+1; }
  }
  if (flag==0) {
    printf(\"%e \", x)
    for (c=1; c<${ncol}; c++) { printf(\"%e \", y[c]/$#) }
    printf(\"\n\")
  } else { printf(\"# x -coordinate mismatch\n\") }
}"

exit 0

My problem is that for a large number of data files it becomes slow quickly and at some point throws me a "Too many open files" error. I see that simply pasting all data files in one go (paste "$@") is the issue here but doing it in batches and somehow introducing temp-files seems also not to be the ideal solution. I'd appreciate any help to make this more scalable while retaining the way the script is called, i.e., all data files passed as command-line arguments

I decided to also post this in the python section since I am often told that it's very handy to deal with this kind of problems. I, however, have almost no experience with python but maybe this is the occasion to finally start learning it ;)

score 2 · Accepted Answer · answered Dec 29 '13 at 00:58

The code appended below works in Python 3.3 and produces the desired output, with a few minor caveats:

It grabs the initial comment line from the first file that it processes, but doesn't bother to check that all of the other ones after that still match (i.e., if you have several files that start with #A and one that starts with #C, it won't reject the #C, even though it probably should). I mainly wanted to illustrate how the merge function would work in Python, and figured that adding in this type of miscellaneous validity check is best left as a "homework" problem.
It also doesn't bother to check that the number of rows and columns match, and will likely crash if they don't. Consider it another minor homework problem.
It prints all columns to the right of the first one as float values, since in some cases, that's what they might be. The initial column is treated as a label or line number, and is therefore printed as an integer value.

You can call the code in almost the same way as before; e.g., if you name the script file merge.py, you can do python merge.py data0001.dat data0002.dat and it will print the merged average result to stdout just as with the bash script. The code also has an added flexibility to compared to one of the earlier answers: the way it's written, it should in principle (I haven't actually tested this to make sure) be able to merge files with any number of columns, not just files that have precisely three columns. Another nice benefit: it doesn't keep files open after it is done with them; the with open(name, 'r') as infile: line is a Python idiom that automatically results in a file closure after the script is finished reading from the file, even though close() is never explicitly called.

#!/usr/bin/env python

import argparse
import re

# Give help description
parser = argparse.ArgumentParser(description='Merge some data files')
# Add to help description
parser.add_argument('fname', metavar='f', nargs='+',
                    help='Names of files to be merged')
# Parse the input arguments!
args = parser.parse_args()
argdct = vars(args)

topcomment=None
output = {}
# Loop over file names
for name in argdct['fname']:
    with open(name, "r") as infile:
        # Loop over lines in each file
        for line in infile:
            line = str(line)
            # Skip comment lines, except to take note of first one that
            # matches "#A"
            if re.search('^#', line):
                if re.search('^#A', line) != None and topcomment==None:
                    topcomment = line
                continue
            items = line.split()
            # If a line matching this one has been encountered in a previous
            # file, add the column values
            currkey = float(items[0])
            if currkey in output.keys():
                for ii in range(len(output[currkey])):
                    output[currkey][ii] += float(items[ii+1])
            # Otherwise, add a new key to the output and create the columns
            else:
                output[currkey] = list(map(float, items[1:]))

# Print the comment line
print(topcomment, end='')
# Get total number of files for calculating average
nfile = len(argdct['fname'])              
# Sort the output keys
skey = sorted(output.keys())
# Loop through sorted keys and print each averaged column to stdout
for key in skey:
    outline = str(int(key))
    for item in output[key]:
        outline += ' ' + str(item/nfile)
    outline += '\n'
    print(outline, end='')

score 0 · Answer 2 · edited May 23 '17 at 12:03

Just as a quick check on the number of file handlers you have / are used try this (unix):

cat /proc/sys/fs/file-nr

That will give you (number of allocated file handlers) - (number of allocated but unused file handlers) - (maximum number of file handlers) --- see here.

It is possible to change the limits in sysctl.conf (on linux - see link above) - but this is probably not a good idea resource management-wise and, therefore, isn't really scalable. And, yes, things start to get slower as more handlers get used for opening each file (as they don't get closed until after the shell execution stops/ends) and eventually it fails when there are no more handlers available.

One potential solution could incorporate Python/SciPy/Pandas and a simple database. There's great documentation and large community support. An example that is closely related to your post is here. A small post on interfacing Pandas and a database linked here.

I haven't tried this but I'll give it a shot:

For the database you could use something like the pandas io.sql module to create a useful representation of each dat file (maybe using the A# header as an identifier for each table). Then the data can be manipulated by any number of methods e.g. glued. This does not retain the ./merge.sh data* > dataComb.dat functionality that you ask for but a simple python command line script could potentially handle all the steps to get data in and processed as you wish.

I think it would be quite the learning curve but it could pay off in scalability/flexibility for the future.

score 0 · Answer 3 · answered Dec 28 '13 at 21:30

You seem to struggle with the fact that too many files are open at once. You seem to have figured out how to handle the rest of the processing (i.e. sorting the files based on the unique IDs and accessing the values contained in a single .dat file), so i'll just focus on this issue

When dealing with multiple sources, a common trick is to remember that you don't need to have all the values at once to compute the average. All you need is the sum and the number of values you have added.

I'm not familiar with awk syntax, so i'll write in pseudo-code.

Create a table sum matching your data structure. Let's say sum[x][y] holds the value at column x and line y. The table is initially filled with zeroes.
Set a counter n = 0
Open the first file. I'll skip the processing part you seem to have handled, so say data contains the values you've extracted. The access is similar to the one described for sum.
Add the values to your sum table : sum[x][y] += data[x][y] for every x and y value
Close the file.
Increment the counter : n += 1
Repeat steps 3 to 6 until you've processed all files
Compute the average : sum[x][y] = sum[x][y] / n, for every x and y values
You got it! sum now contains the average youre looking for.

This algorithm handles any number of files and only one file is open at any given time.

score 0 · Answer 4 · answered Dec 28 '13 at 21:31

You can try this code, the main idea is to read the files iteratively, and update a dictionary object with the count and sum of the second and third values in each number, good luck!

#first you get the paths for all the dat files:

import os
dat_dir=r'c:\dat_dir'
our_files=[path for os.path.join(dat_dir,f) for f in os.listdir(dat_dir)]

#then you iterate over them and update a dictionary object with the results for each file:

dict_x_values={}
for f in our_files:
    fopen=open(f,'r')
    for line in fopen:
        line=line.strip('\n')
        split=[int(v) for v in line.split()]
        if len(split)==3:
            key=split[0]
            if dict_x_values.has_key(key):
                second_count,second_sum=dict_x_values[key][0] #the second value in the row
                second_count+=1 #we increment the count
                second_sum+=split[1] #we increment the sum
                third_count,third_sum=dict_x_values[key][1] #the third value in the row
                third_count+=1
                third_sum+=split[2]
                dict_x_values[key]=[[second_count,second_sum],[third_count,third_sum]]
            else:
                dict_x_values[key]=[[1,split[1]],[1,split[1]]] #if the dictionary doesn't have the left x-value, we initialize it
    fopen.close()


#Then we write our output combined file

comb_open=open('comb_dat.txt','w')


for key in dict_x_values:
    second_count,second_sum=dict_x_values[key][0] #the second value in the row
    third_count,third_sum=dict_x_values[key][1] #the third value in the row
    second_avg=float(second_sum)/second_count
    third_avg=float(third_sum)/third_count
    line='%s\t%s\t%s'%(key,second_avg,third_avg)

comb_open.close()

combine/average multiple data files

4 Answers4