1

I have ~1.2k files that when converted into dataframes look like this:

df1
   A     B     C     D
0  0.1   0.5   0.2   C
1  0.0   0.0   0.8   C
2  0.5   0.1   0.1   H
3  0.4   0.5   0.1   H
4  0.0   0.0   0.8   C
5  0.1   0.5   0.2   C
6  0.1   0.5   0.2   C

Now, I have to subset each dataframe with a window of fixed size along the rows, and add its contents to a second dataframe, with all its values originally initialized to 0.

df_sum
   A    B    C   
0  0.0  0.0  0.0 
1  0.0  0.0  0.0 
2  0.0  0.0  0.0 

For example, let's set the window size to 3. The first subset therefore will be

window = df.loc[start:end, 'A':'C']
window
   A     B     C 
0  0.1   0.5   0.2 
1  0.0   0.0   0.8 
2  0.5   0.1   0.1  

window.index = correct_index
df_sum = df_sum.add(window, fill_value=0)
df_sum
   A     B     C 
0  0.1   0.5   0.2 
1  0.0   0.0   0.8 
2  0.5   0.1   0.1 
 

After that, the window will be the subset of df1 from rows 1-4, then rows 2-5, and finally rows 3-6. Once the first file has been scanned, the second file will begin, until all file have been processed. As you can see, this approach relies on df.loc for the subset and df.add for the addition. However, despite the ease of coding, it is very inefficient. On my machine it takes about 5 minutes to process the whole batch of 1.2k files of 200 lines each. I know that an implementation based on numpy arrays is orders of magnitude faster (about 10 seconds), but a bit more complicated in terms of subsetting and adding. Is there any way to increase the performance of this method while stile using dataframe? For example substituting the loc with a more performing slice method.

Example:

def generate_index_list(window_size):
    before_offset = -(window_size - 1)// 2
    after_offset = (window_size - 1)// 2
    index_list = list()
    for n in range(before_offset, after_offset + 1):
        index_list.append(str(n))
    return index_list

window_size = 3
for file in os.listdir('.'):
    df1 = pd.read_csv(file, sep= '\t')
    starting_index = (window_size - 1)//2
    before_offset = (window_size - 1)// 2
    after_offset = (window_size -1)//2
    for index in df1.iterrows():
        if index < starting_index or index + before_offset + 1 > len(profile.index):
            continue
        indexes = generate_index_list(window_size) 
        window = df1.loc[index - before_offset:index + after_offset, 'A':'C']
        window.index = indexes
        df_sum = df_sum.add(window, fill_value=0)

Expected output:

df_sum
   A    B    C   
0  1.0  1.1  2.0 
1  1.0  1.1  2.0 
2  1.1  1.6  1.4
saiden
  • 322
  • 3
  • 14
  • And what is `generate_index_list()`? For SO posts, we need a [mcve]. Also, if code works and you need advice on refactoring for efficiency, consider posting on [CodeReview StackExchange](https://codereview.stackexchange.com). – Parfait Oct 11 '20 at 16:18
  • It's just a simple function to set the correct indexes for the `add`. Thank you for you insights, I'm still new to posting questions on SO – saiden Oct 11 '20 at 16:28
  • Can you the expected output to your question? – Code Different Oct 12 '20 at 02:44
  • Done, hope it's clearer now – saiden Oct 13 '20 at 10:10

1 Answers1

0

Consider building a list of subsetted data frames with.loc and .head. Then run groupby aggregation after individual elements are concatenated.

window_size = 3

def window_process(file):
    csv_df = pd.read_csv(file, sep= '\t')

    window_dfs = [(csv_df.loc[i:,['A', 'B', 'C']]   # ROW AND COLUMN SLICE
                         .head(window)              # SELECT FIRST WINDOW ROWS
                         .reset_index(drop=True)    # RESET INDEX TO 0, 1, 2, ...
                  ) for i in range(df.shape[0])]  
                     
    sum_df = (pd.concat(window_dfs)                 # COMBINE WINDOW DFS
                .groupby(level=0).sum())            # AGGREGATE BY INDEX 
        
    return sum_df

# BUILD LONG DF FROM ALL FILES
long_df = pd.concat([window_process(f) for file in os.listdir('.')])

# FINAL AGGREGATION
df_sum = long_df.groupby(level=0).sum()

Using posted data sample, below are the outputs of each window_dfs:

     A    B    C
0  0.1  0.5  0.2
1  0.0  0.0  0.8
2  0.5  0.1  0.1

     A    B    C
0  0.0  0.0  0.8
1  0.5  0.1  0.1
2  0.4  0.5  0.1

     A    B    C
0  0.5  0.1  0.1
1  0.4  0.5  0.1
2  0.0  0.0  0.8

     A    B    C
0  0.4  0.5  0.1
1  0.0  0.0  0.8
2  0.1  0.5  0.2

     A    B    C
0  0.0  0.0  0.8
1  0.1  0.5  0.2
2  0.1  0.5  0.2

     A    B    C
0  0.1  0.5  0.2
1  0.1  0.5  0.2

     A    B    C
0  0.1  0.5  0.2

With final df_sum to show accuracy of DataFrame.add():

df_sum
     A    B    C
0  1.2  2.1  2.4
1  1.1  1.6  2.2
2  1.1  1.6  1.4
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • The proposed solution works a little bit better than mine, but is still quite slow. – saiden Oct 13 '20 at 09:55
  • I almost guessed performance marginally changes (though readability is improved). See updated approach. Possibly using `DataFrame.sum` within a loop leads to [quadractic copying](https://stackoverflow.com/a/36489724/1422451) similar to `append` or `concat` in a `for` loop. Edit uses `concat` on list comprehensions and `groupby` aggregation. – Parfait Oct 13 '20 at 15:43