How to get means of columns each X rows from several csv files

Question

I've been wrapping my head around this problem for some days now:

I have several dataframes in csv files which have all the same format (column names, row names), and I want to calculate the mean of each row of each file, then save it in an output csv file. And that works like this:

import glob

import csv

with open('result.csv', 'w', newline='') as f_output:

    csv_output = csv.writer(f_output)

    for filename in glob.glob('F*.csv'):
        print (filename)
        with open (filename, newline='') as f_input:

            csv_input = csv.reader(f_input)
            header = next(csv_input)
            averages = [] #making a list for averages 

            for col in zip(*csv_input):
                averages.append(sum(float(x) for (x) in col) / len(col))


        csv_output.writerow([filename] + averages)

But I actually need the mean of each colum of each dataframe every 18 rows (because every row = 1 minute) excluding the header (beacuse that's a string). So I've been trying to select only the first 18 rows with something like this in the for loop

df = pd.read_csv(df,skiprows=0, nrows = 18) #to get only first 18 minutes'

But this is 1)not working (I think beacuse it's not applicable to a reader) 2)not ideal beacuse I actually just want to divide all the dfs in chunks of 18 rows and then calculate the mean of each row.

Any hint would be really helpful, thanks!

update

"Maybe you can show us a small sample dataframe (with 4-5 columns and rows) and show us what output you want (pretending you want it every 2 rows instead of every 18)."

This is my sample df for each file

      col1       col2       col3     col4        col5        col6
0   1.228516    61.228516   1.0     418.808284  957.419867  1025.554374
1   61.228516   121.228516  1.0     207.864712  694.581742  814.149004  
2   121.228516  181.228516  1.0     140.516367  370.653176  809.100268  
3   181.228516  241.228516  1.0     140.274964  448.755593  885.196647  
4   241.228516  301.228516  1.0     117.079110  312.173256  1907.873698

I wnat to calculate the mean of the first two rows (0,1) for each colum, let's say here forl col6. So my output would be (1025.554374+814.149004)/2 #definition of arithmetic mean.

I just need this in all my csv files but for every 18 rows

Hope it's clearer now, thanks!

Welcome to SO, @Vera! It sounds like you have two different problems: reading the files and calculating the means in chunks. What's going wrong when you're reading the files? — ASGM, Apr 28 '20 at 13:17
Also, it's not clear whether you want the mean of each row ("I want to calculate the mean of each row"), each column ("I actually need the mean of each colum[n]"), or some version of either of those things every 18 rows. Maybe you can show us a small sample dataframe (with 4-5 columns and rows) and show us what output you want (pretending you want it every 2 rows instead of every 18). — ASGM, Apr 28 '20 at 13:19
Hi @ASGM, thanks for your quick reply! Reading the files works fine in the for loop but I don't know how to integrate (and how to write) a function that splits every csv file in chunks of 18 rows. Like it would be ideal if I could do this when I am importing the files in the for loop. See my answer to your second question above — Vera, Apr 28 '20 at 13:44
Thank you so much @ASGM your explanation solved the problem for me! I'm trying to add a new line of code that tells in the output which file the means belong to every 4 rows (0,1,2,3). — Vera, Apr 28 '20 at 20:22

ASGM · Answer 1 · 2020-04-28T15:36:02.410

0

Rather than using a for loop, I'd recommend loading the whole dataframe and using groupby and floor division:

df.groupby(df.index // 18).mean()

You don't need a for loop to do this to all of your files:

def group_mean(df):
    return df.groupby(df.index // 18).mean()
df = pd.concat([group_mean(pd.read_csv(fname)) for fname in glob.glob('F*.csv')])
df.to_csv('result.csv')

If you want to do it in a for loop for some reason:

dfs = []
for fname in fname in glob.glob('F*.csv'):
    df = pd.read_csv(fname)
    dfs.append(df.groupby(df.index // 18).mean())
pd.concat(dfs).to_csv('result.csv')

If you really want to load a dataframe 18 rows at a time (which would only make sense if your data were really massive), you can use the chunksize operator:

reader = pd.read_csv('result.csv', chunksize=18)
for chunk in reader:
    print(chunk.mean())

edited Apr 28 '20 at 15:36

answered Apr 28 '20 at 13:51

ASGM

11,051
1
32
53

Sorry this might have been not clear in my post but the 'result' file is actually the result already of the means of all columns (giving only one value as an output). So I need to chunk every file I import (above I'm doing it with glob). Updated the post above now with an example, hope it is clearer now – Vera Apr 28 '20 at 14:02
Yes, thank you so much! Now the thing is I would like to do it for all .csv files in one directory. Like I can write this now `data = pd.read_csv('FD_HRV_pp1.csv') data.head() datanew = data.groupby(data.index // 17).mean() datanew.head()` And then save it like `datanew.to_csv('FD_HRV_pp1_chunkmeans.csv')`. How can I create a for loop that does that for all files? ideal would be to have it like in the original version where I imported all csv files with glob and created a for loop – Vera Apr 28 '20 at 15:16
let me know if you need any other clarifications, thank you in advance! – Vera Apr 28 '20 at 16:38

How to get means of columns each X rows from several csv files

1 Answers1