averaging datasets of varying length

Question

I have a series of datasets outputted from a program. My goal is to plot an average of the datasets as a line graph in pyplot or numpy. My problem is that the length of the outputted datasets is not controllable.

For example, I have four data sets of lengths varying between 200 and 400 points with x values normalised to figures from 0 to 1, and I want to calculate the median line for the four datasets.

All I can think of so far is to interpolate (linearly would be sufficient) to add extra data points to the shorter sequences, or somehow periodically remove values from the longer sequences. Does anyone have any suggestions?

At the moment I am importing with csv reader and appending row by row to a list, so the output is a list of lists, each with a set of xy coordinates which I think is the same as a 2d array?

I was actually thinking it may be easier to delete excess data points than to interpolate, for example, starting with four lists, I could remove unnecessary points along the x axis since they are normalised and increasing, then cull points with too small a step size by referencing the shortest list step sizes (this explanation may not be so clear, I will try to write up an example and put it up tomorrow)

An example data set would be

line1=[[0.33,2],[0.66,5],[1,5]]

line 2=[[0.25,43],[0.5,53],[0.75,6.5],[1,986]]

Example data? Desired results? What is the form of the datasets outputted from the program? — Alexander, Sep 03 '15 at 05:35
@Alexander It looks like everything is actually given, for one; the desired results stated as a 'median line for the four datasets' — Ross, Sep 03 '15 at 05:40
@Ross. Are you kidding? There is no data. Are the results lists, dictionaries, pandas series, etc. This is applicable for how to ask a good Pandas question, but is relevant here. http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — Alexander, Sep 03 '15 at 05:45
Sounds like a data analysis design issue and I think the answer depends on what your data is, what it means to interpolate it, etc. You might get good answers if you provide more detail about your data at the [Cross Validated](http://stats.stackexchange.com/) exchange. Once you know how you want to handle your data, sharing a minimal example of data and code here will make it easier for people to help you. For what it's worth, the only thing I could think of is to interpolate your data as you suggested. — KobeJohn, Sep 03 '15 at 05:46
@kobejohn - Agreed. And can OP tell us if you want to interpolate linearly or otherwise? — Ross, Sep 03 '15 at 05:53

score 0 · Accepted Answer · answered Sep 09 '15 at 06:52

so the solution that I used was to interpolate as suggested above, I've included a simplified version of the code below:

first the data is imported as a dictionary for ease of access and manipulation:

def average(files, newfile):    
    import csv                                                                  
    dict={}                                                                     
    ln=[]                                                                       
    max=0                                                                       
    for x in files:                                                             
        with open(x+'.csv', 'rb') as file:                                      
            reader = csv.reader(file, delimiter=',')                            
            l=[]                                                                
            for y in reader:                                                    
                l.append(y)                                                     
            dict[x]=l                                                           
            ln.append(x)

Next the length of the longest data set is established:

    for y in ln:                                                                
        if max == 0:                                                            
            max = len(dict[y])                                                  
        elif len(dict[y]) >= max:                                               
            max = len(dict[y])

next the number of iterations required for each dataset needs to be defined:

    for y in ln:                                                                
        dif = max - len(dict[y])

finally the intermediary values are calculated by linear interpolation and inserted to the dataset

        for i in range(dif):                                                    
            loc = int( i* len(dict[y])/dif)                                     
            if loc ==0:                                                         
                loc =1                                                          
            new = [(float(dict[y][loc-1][x])+float(dict[y][loc][x]))/2
            for x in range(len(dict[y][loc]))]
            dict[y].insert(loc,new)

then taking the average is very simple:

    for x in range(len(dict[ln[0]])):                                           
        t = [sum(float(dict[u][x][0]) for u in ln)/len(ln), 
        sum(float(dict[u][x][1])/4 for u in ln)]
        avg.append(t)

I'm not saying it's pretty code, but it does what I needed it to...

averaging datasets of varying length

1 Answers1