0

I have a csv file containing timestamps and data. The timestamps are organized in 10 second interval.. I'm trying to find the average of each column and group by the hour. I'm new to Python. So far I've the following code, it executes without error but it generates a blank file. Any help is appreciated. Thanks.

import csv  
import sys  
import itertools  


with open('convertcsv.csv', "r") as input, open ('test1.csv',"w") as output:    
    reader = csv.reader(input)
    average = 0
    Sum = 0  
    row_count = 0  
    for row in input:  
        for column in row.split(','):  
            n = float(column[1])  
            Sum += n  
            row_count += 1  
    average = Sum / len(column[1])  
    print (average)  
    writer = csv.writer(output)  
    writer.writerow(average)  
Stefan
  • 41,759
  • 13
  • 76
  • 81

2 Answers2

1

I would recommend using pandas for this.

There is read_csv() and once you have your data in a DataFrame, you can use groupby and TimeGrouper (after setting your TimeStamp columns as index like so:

import pandas as pd
df = pd.read_csv(path)
df.set_index(time_stamp_column, inplace=True)
hourly_avg = df.groupby(pd.TimeGrouper('H')).mean()
hourly_avg.to_csv(path/to/save/file.csv)

Result should be a .csv file that with a column for each variable's average and a row for each hour in your file.

Stefan
  • 41,759
  • 13
  • 76
  • 81
  • Thank you for your reply. Since I'm very new to python, to write the file into a csv file am I using the correct manner to do so? Since it is a very large file, I want the output to another file rather than on the terminal. – user3264280 Nov 30 '15 at 18:17
  • Also for writing to csv, I would use the `pandas` library. Your code looks like the output file should only contain the average value for your `column[1]`? After following the steps in my example, you could do `hourly_avg.to_csv(path)` and that would be it. Since `pandas` has been developed for tasks like these and of course many others, it's often easier to go this way than using the standard library. For `writerow`, I believe it takes a `list` arguments, so you should try `.writerow([average])`. – Stefan Nov 30 '15 at 18:39
  • Thank you for your help. As I'm a beginner, I was trying to see if I atleast get the values of column[1] correctly. And if it does, I can add the other columns as well. – user3264280 Nov 30 '15 at 18:44
  • If you prefer using the standard `csv` functionality, just follow http://stackoverflow.com/questions/14134237/writing-data-from-a-python-list-to-csv-row-wise – Stefan Nov 30 '15 at 19:00
  • Hi, can you please explain on how to add the hourly_avg.to_csv(path)? I believe what I've understood and what I'm doing is not the correct way. df[hourly_avg].to_csv("C:/Anaconda3/convertcsv.csv") Also, if I've understood correctly, "df.groupby(pd.TimeGrouper('H').mean()" will calculate the mean and group by the Hour? – user3264280 Nov 30 '15 at 19:00
  • I've updated the answer - let me know if we can help further. – Stefan Nov 30 '15 at 19:04
  • thank you, I really appreciate al your help. hourly_avg.to_csv('file.csv') - gives me a syntax error. Also, if I try print(df.groupby(pd.TimeGrouper('H').avg()). This gives me a syntax error as well. What am I doing incorrectly? – user3264280 Nov 30 '15 at 19:29
  • Would help to see the exact error message, but from afar suggests that `hourly_avg` is not a `DataFrame`. What does it show if you `print(hourly_avg)` before `.to_csv`? Also see http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.to_csv.html – Stefan Nov 30 '15 at 19:33
  • print (hourly_avg) ^ SyntaxError: invalid syntax : This is what I get before .to_csv – user3264280 Nov 30 '15 at 19:38
  • try using `print hourly_avg` if on `python 2.x`, `print(hourly_avg)` is for `python3`. – Stefan Nov 30 '15 at 19:39
  • I'm using Python3 that's why I used the parantheses – user3264280 Nov 30 '15 at 19:44
  • I see. What does `print(df)` give you before you get to the `hourly_avg`? – Stefan Nov 30 '15 at 19:46
  • Thank you, there was a missing parantheses on the previous line. – user3264280 Nov 30 '15 at 19:47
  • I used the following code and it ran properly: def same_day(date_string): return datetime.strptime(date_string, "%Y-%m-%d %H:%M:%S").strftime('%H:%M') df = pd.read_csv('convertcsv.csv', parse_dates = True, index_col=0, usecols=[0,1,2,3,4,5,6,7], names = ['date', 'data1', 'data2', 'data3', 'data4', 'data5', 'data6', 'data7']) hourly_avg = df.groupby(pd.TimeGrouper('H')) hourly_avg.to_CSV('file.csv', cols =('timestamp','header','min','max','average')) - I'm not sure if this is the correct way to write to csv file. I get AttributeError:Cannot access callable attribute 'to_csv' – user3264280 Nov 30 '15 at 21:48
  • You still need to apply your aggregation function to `df.groupby(pd.TimeGrouper('H')) ` before assigning to `hourly_avg` and storing to csv. At this point, you only have a grouped `DataFrame` without any averages. Simply add `.mean()` to the end, ie, `df.groupby(pd.TimeGrouper('H')).mean()` – Stefan Nov 30 '15 at 21:52
  • Thank you very much. But it still gives me error AttributeError: 'TimeGrouper' object has no attribute 'mean' – user3264280 Nov 30 '15 at 22:08
  • Sorry just saw there was a parenthesis missing - should have been `hourly_avg = df.groupby(pd.TimeGrouper('H')).mean()`, answer updated. – Stefan Nov 30 '15 at 22:56
  • Thank you, it writes to a csv file now. i'm now trying to put the data in this format: timestamp, header, min, max, average. So far I've tried mh1 = {'m' :np.mean, 'h':np.min, 'l':np.max} hourly_avg.resample("60min", how = mh1) I'm not sure if what i'm doing is correct. Could you please guide me as to how can I group the data in this order to write to csv file? – user3264280 Nov 30 '15 at 23:20
  • As this is getting a bit long here and it seems like you've asked a second question on the to_csv part, I'll add my answer there. – Stefan Nov 30 '15 at 23:22
  • Thank you :) I've just edited the code to where I'm at right now, but I'm not sure if its the right path – user3264280 Nov 30 '15 at 23:26
  • hello, I was just wondering that I have CSV file and learnt something... Is it possible to read the same data from JSON File? I have started with something like df = pd.read_json('data.json', convert_dates = True, But i dont understand how to parse the columns? – user3264280 Dec 04 '15 at 02:32
0

I know pandas is extremely intimidating when you're just learning Python but trust me, it is the way to go instead of using the csv module. You can do a lot in a few lines of code vs csv module and for loops and defining variables manually.

Grayed out below is an example of how you'd read your data in, check datatypes, resample the data, and write to csv file.

The biggest problem you might run into is getting your datatypes (dtypes) set properly. For example, if you read your data in and check the datatypes, you might see this:

df.dtypes
Index     object
A        float64
B        float64
C        float64
D        float64
dtype: object

You first need the Index in a datetime datatype. To do this, do the following:

df['Index'] = pd.to_datetime(df['Index'])

Then check your datatypes again to confirm you've converted Index to a datetime datatype:

Index    datetime64[ns]
A               float64
B               float64
C               float64
D               float64
dtype: object

In order to resample in pandas, your index needs to be a DatetimeIndex. To set the index in a dataframe, use:

df = df.set_index('Index')

If your datatypes are now correct, you can perform the resample.

import pandas as pd
import numpy as np

#UNCOMMENT THE CODE PARTS BELOW IF DESIRED
## cp1252 encoding works best on my windows machine
#df = pd.read_csv('convertcsv.csv', encoding='cp1252')
## check datatypes to make sure they are not 'object' when it should be 'float64' or 'int64' for example
#print(df.dtypes)
## you want to group by hour and find the average (aka: mean) which is where resample comes in
## the 'H' means 'hours' and how='mean' is telling it what to do with the data after it groups by hour
#df = df.resample('H', how='mean')
## you want to write test1.csv . If you don't want the index, set index=False
#df.to_csv('test1.csv', index=False)

#Example
index = pd.date_range('1/1/2015', periods=6*60*3, freq='10S')
data = abs(np.random.randn(6*60*3, 4))
df = pd.DataFrame(data=data, index=index, columns=list('ABCD'))
df = df.resample('H', how='mean')
print(df)
Jarad
  • 17,409
  • 19
  • 95
  • 154
  • Thank you for your reply :) yes, after having some trouble i did change my index to df['Index'] = pd.to_datetime(df['Index']). – user3264280 Dec 04 '15 at 02:05
  • hello, I was just wondering that I have CSV file and learnt something... Is it possible to read the same data from JSON File? I have started with something like df = pd.read_json('data.json', convert_dates = True, But i dont understand how to parse the columns? – user3264280 Dec 04 '15 at 02:33
  • @user3264280 You're definitely on the right track. I don't work much with JSON file reading but [Andy Hayden](http://hayd.github.io/2013/pandas-json/) has a good example. I think you're trying to understand, when the `dataframe` is created, how do you narrow the `dataframe` down to the columns you want. The main way is to do `df = df[['Column 1', 'Something Else', 'Another column I want']]`. I would start with just `df = pd.read_json('data.json')` then see what's in the `df`. Chances are, Pandas parsed the columns for you. – Jarad Dec 04 '15 at 04:32
  • Thank you, yes, as reading from CSV was not too difficult and had lots of examples but reading from JSON using pandas is little difficult. I'm assuming it the way the parameters are passed, but I'm not quite getting how to do it. Following your lead, I tried to do that df = df[['time'], ['d1'], ['d2'], ['d3'], ['d4'], ['d5'], ['d6'], ['d7']] but then i get this errror TypeError: unhashable type: 'list' – user3264280 Dec 04 '15 at 04:38