1

I want to calculate the average for each hours using a CSV file:

Below is my DATA SET:

Timestamp    Temperature
9/1/2016 0:00:08    53.8
9/1/2016 0:00:38    53.8
9/1/2016 0:01:08    53.8
9/1/2016 0:01:38    53.8
9/1/2016 0:02:08    53.8
9/1/2016 0:02:38    54.1
9/1/2016 0:03:08    54.1
9/1/2016 0:03:38    54.1
9/1/2016 0:04:38    54
9/1/2016 0:05:38    54
9/1/2016 0:06:08    54
9/1/2016 0:06:38    54
9/1/2016 0:07:08    54
9/1/2016 0:07:38    54
9/1/2016 0:08:08    54.1
9/1/2016 0:08:38    54.1
9/1/2016 0:09:38    54.1
9/1/2016 0:10:32    54
9/1/2016 0:11:02    54
9/1/2016 0:11:32    54
9/1/2016 0:00:08    54
9/2/2016 0:00:20    32
9/2/2016 0:00:50    32
9/2/2016 0:01:20    32
9/2/2016 0:01:50    32
9/2/2016 0:02:20    32
9/2/2016 0:02:50    32
9/2/2016 0:03:20    32
9/2/2016 0:03:50    32
9/2/2016 0:04:20    32
9/2/2016 0:04:50    32
9/2/2016 0:05:20    32
9/2/2016 0:05:50    32
9/2/2016 0:06:20    32
9/2/2016 0:06:50    32
9/2/2016 0:07:20    32
9/2/2016 0:07:50    32

Here is my code for calculating per day average, but I want per hour:

from datetime import datetime
import pandas
def same_day(date_string): # Remove year
return datetime.strptime(date_string, "%m/%d/%Y %H:%M%S").strftime(%m%d')

df = pandas.read_csv('/home/kk/Desktop/cal_Avg.csv',index_col=0,usecols=[0, 1], names=['Timestamp', 'Discharge'],converters={'Timestamp': same_day})

print(df.groupby(level=0).mean())

My desired output is like:

Timestamp              Temp          *        Avg
9/1/2016 0:00:08    53.8
9/1/2016 0:00:38    53.8    ?avg for this hour
9/1/2016 0:01:08    53.8
9/1/2016 0:01:38    53.8    ?avg for this hour
9/1/2016 0:02:08    53.8
9/1/2016 0:02:38    54.1

Now I want the average for specific hours , Min

Desired output:

Here I am printing only 5 hours output for date 01-09-2016 and 02-09-16

010900              54.362727         45.497273
010901              54.723276         45.068103
010902              54.746847         45.370270
010903              54.833913         44.931304
010904              54.971053         44.835088
010905              55.519444         44.459259
020901              31.742553         55.640426
020902              31.495556         55.655556
020903              31.304348         55.442609
020904              31.200000         55.437273
020905              31.294382         55.442697

Specific date and there specific hours? How do I archive this?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Neha B
  • 75
  • 3
  • 9

2 Answers2

1

I think you need first read_csv with parameters index_col=[0] for read first column to index and parse_dates=[0] for parse first column to DatetimeIndex:

df = pd.read_csv('filename', index_col=[0], parse_dates=[0],, usecols=[0,1])
print (df)
                     Temperature
Timestamp                       
2016-09-01 00:00:08         53.8
2016-09-01 00:00:38         53.8
2016-09-01 00:01:08         53.8
2016-09-01 00:01:38         53.8
2016-09-01 00:02:08         53.8
2016-09-01 00:02:38         54.1
2016-09-01 00:03:08         54.1
...
...

Then use resample by hours and aggregate Resampler.mean, but get NaN for missing data in DatetimeIndex:

print (df.resample('H').mean())
                     Temperature
Timestamp                       
2016-09-01 00:00:00    53.980952
2016-09-01 01:00:00          NaN
2016-09-01 02:00:00          NaN
2016-09-01 03:00:00          NaN
2016-09-01 04:00:00          NaN
2016-09-01 05:00:00          NaN
2016-09-01 06:00:00          NaN
2016-09-01 07:00:00          NaN
2016-09-01 08:00:00          NaN
2016-09-01 09:00:00          NaN
2016-09-01 10:00:00          NaN
2016-09-01 11:00:00          NaN
2016-09-01 12:00:00          NaN
2016-09-01 13:00:00          NaN
2016-09-01 14:00:00          NaN
2016-09-01 15:00:00          NaN
2016-09-01 16:00:00          NaN
2016-09-01 17:00:00          NaN
2016-09-01 18:00:00          NaN
2016-09-01 19:00:00          NaN
2016-09-01 20:00:00          NaN
2016-09-01 21:00:00          NaN
2016-09-01 22:00:00          NaN
2016-09-01 23:00:00          NaN
2016-09-02 00:00:00    32.000000

Another solution is remove minutes and seconds by casting to hours and groupby by this array:

print (df.index.values.astype('<M8[h]'))
['2016-09-01T00' '2016-09-01T00' '2016-09-01T00' '2016-09-01T00'
 '2016-09-01T00' '2016-09-01T00' '2016-09-01T00' '2016-09-01T00'
 '2016-09-01T00' '2016-09-01T00' '2016-09-01T00' '2016-09-01T00'
 '2016-09-01T00' '2016-09-01T00' '2016-09-01T00' '2016-09-01T00'
 '2016-09-01T00' '2016-09-01T00' '2016-09-01T00' '2016-09-01T00'
 '2016-09-01T00' '2016-09-02T00' '2016-09-02T00' '2016-09-02T00'
 '2016-09-02T00' '2016-09-02T00' '2016-09-02T00' '2016-09-02T00'
 '2016-09-02T00' '2016-09-02T00' '2016-09-02T00' '2016-09-02T00'
 '2016-09-02T00' '2016-09-02T00' '2016-09-02T00' '2016-09-02T00'
 '2016-09-02T00']

print (df.groupby([df.index.values.astype('<M8[h]')]).mean())
            Temperature
2016-09-01    53.980952
2016-09-02    32.000000

Also if need meean by months, days and hours is posible groupby by DatetimeIndex.strftime:

print (df.index.strftime('%m%d%H'))
['090100' '090100' '090100' '090100' '090100' '090100' '090100' '090100'
 '090100' '090100' '090100' '090100' '090100' '090100' '090100' '090100'
 '090100' '090100' '090100' '090100' '090100' '090200' '090200' '090200'
 '090200' '090200' '090200' '090200' '090200' '090200' '090200' '090200'
 '090200' '090200' '090200' '090200' '090200']

print (df.groupby([df.index.strftime('%m%d%H')]).mean())
        Temperature
090100    53.980952
090200    32.000000

Or if need mean only by hours groupby by DatetimeIndex.hour:

print (df.index.hour)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

print (df.groupby([df.index.hour]).mean())
   Temperature
0    44.475676
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • its calculating Average for per day ..i want for per hours? – Neha B Oct 26 '16 at 07:35
  • Sorry, it is for me a bit unclear. I try add multiple solution, if is missing what you need, what is desired output from your input? – jezrael Oct 26 '16 at 07:37
  • i want to calculate average for per hours like ; for this hour want Average: 2016-09-01 00:00:08 53.8 2016-09-01 00:00:38 53.8 for this hour want Average: 2016-09-01 00:02:08 53.8 2016-09-01 00:02:38 54.1 like this... – Neha B Oct 26 '16 at 07:40
  • Sorry, still unclear for me - you need groupby by `0` hour in all sample? Od by `month`, `day` and `hour` ? – jezrael Oct 26 '16 at 07:48
  • I think the best is use small sample of data like `df = pd.DataFrame({'Temperature': [1,2,3,4,5]}, index = pd.date_range(pd.to_datetime('2015-02-24 15:08:04'), periods=5, freq='13H'))` – jezrael Oct 26 '16 at 07:49
  • What is desired outpuf from my small sample? – jezrael Oct 26 '16 at 07:49
  • each hours average i want.. for this hour this is the Avg ,for next Hour this is the AVG – Neha B Oct 26 '16 at 08:54
  • Yes, what is output of my sample input? `df = pd.DataFrame({'Temperature': [1,2,3,4,5]}, index = pd.date_range(pd.to_datetime('2015-02-24 15:08:04'), periods=5, freq='13H'))` ? – jezrael Oct 26 '16 at 08:55
  • in your input one one time is there,,for 24-02-2015 15:08:04,if u give more time then u can get AVG,like if u give 2015-02-24 15:10:05 now we want to calculate this two time and give AVG for it. – Neha B Oct 26 '16 at 08:58
  • Ant this is wrong output? `print (df.groupby([df.index.values.astype(' – jezrael Oct 26 '16 at 09:03
  • this is giving correct output but for date 01-09-2016 but in this date so many hours are there i want per hour separately.See my desire output i edit my question.. – Neha B Oct 26 '16 at 09:10
  • Sorry, still unclear. If use `9/1/2016 1:20:38` then there is `1` hour ? Or there are multiple same times like `9/1/2016 0:00:38 53.8, 9/1/2016 0:00:38 53.8` and need mean of duplicate dates? – jezrael Oct 26 '16 at 09:15
  • will separate the data and than according to time can we calculate per hour Avg?? is it possible? – Neha B Oct 26 '16 at 09:17
  • yes i want per hours so if there is a 9/1/2016 0:00:38 53.8, 9/1/2016 0:00:38 53.8. so here we will calculate as 12am to 1am then 1am to 2 am like that – Neha B Oct 26 '16 at 09:18
  • So you need `print (df.resample('H').mean())` ? – jezrael Oct 26 '16 at 09:19
  • yes but this is also giving output like this only:(not separately ) 2016-09-01 53.99 2016-09-02 32.00 – Neha B Oct 26 '16 at 09:25
  • What mens separately? It means there are `NaN` ? Sorry, I want help you, but I need sample and desired output from sample. Please check [How to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and I try help you. But I still dont know what do you want exactly :( – jezrael Oct 26 '16 at 09:49
  • Thank you so much for your help..its working now.. here is one more question can we calculate this for multiple columns? like temperature is there,one more temp2,temp3...so on – Neha B Oct 26 '16 at 09:49
  • Do you use resample solution? – jezrael Oct 26 '16 at 09:50
  • I check my all solutions and I think it can works witjh multiple columns - If need mean in all columns also. – jezrael Oct 26 '16 at 09:52
  • yes i use Resample solution ..Okay i ll check with the multiple columns. i can export my result in some other csv ? – Neha B Oct 26 '16 at 09:55
  • Yes, use [`to_csv`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html) – jezrael Oct 26 '16 at 09:58
  • Yes, I think you need `df.reset_index().to_csv('myfile.csv')` – jezrael Oct 26 '16 at 10:46
  • and what about if i give i want to see only for 5 hours than how can we do? – Neha B Oct 26 '16 at 11:04
  • I think you need `DatatimeIndex` (print df.index) and then use http://pandas.pydata.org/pandas-docs/stable/timeseries.html#datetimeindex-partial-string-indexing – jezrael Oct 26 '16 at 11:12
  • Sure, but I dont know exactly what you think - resample by `5 hours` or group data by `5 hour` and aggreagate mean or select data from `2016-09-01 01:00:00 to 2016-09-01 06:00:00`. Maybe the best is create new question. Small advice - check [How to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) it is really good advice. Thanks. – jezrael Oct 27 '16 at 06:44
  • thanks..actually i want to give manually if i want for 10 hrs 5 hrs,,or 5 min like that .. – Neha B Oct 27 '16 at 06:45
  • and how can i export my result in a json array format? – Neha B Oct 27 '16 at 07:31
  • check - http://pandas.pydata.org/pandas-docs/stable/io.html#io-json-writer and [`to_json`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html) – jezrael Oct 27 '16 at 08:16
  • try `df.to_json('myfile.json')` – jezrael Oct 27 '16 at 09:08
  • ValueError: No columns to parse from file this error is coimg – Neha B Oct 27 '16 at 09:10
  • Try `df.reset_index().to_json('myfile.json')` – jezrael Oct 27 '16 at 09:11
  • What is `df.info()` ? – jezrael Oct 27 '16 at 09:14
  • No. I am not get it. – jezrael Oct 27 '16 at 09:29
  • rng = date_range('2016-09-01', periods=5, freq='H') i am using this but where should i mention data_range?? – Neha B Oct 27 '16 at 09:43
  • IIUC try `df.loc[rng]` – jezrael Oct 27 '16 at 09:46
  • There is problem I have no your data. Selecting depends on many things, so it is imposible help you. Unfortunately. – jezrael Oct 27 '16 at 10:08
  • same as my question ..according to my dataset which i given above.. like that only; IF i give 9/1/2016 this date and want to see only 5 hrs data AVG or 10 min data AVg ...like that – Neha B Oct 27 '16 at 10:12
  • hmmm, maybe need `print (df.resample('5H').mean())` or `print (df.resample('10t').mean())` – jezrael Oct 27 '16 at 10:25
  • its giving 5 hours AVG but its giving like combined all dates and than calculate avg..I want to calculate for this day give me only 5H avg..per hours individually – Neha B Oct 27 '16 at 11:05
  • Unfortunately I dont understand without sample and desired output. – jezrael Oct 27 '16 at 11:10
  • Yes, now it is more clear. So use [`boolean indexing`](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing) - `mask = pd.Series((df.index.minute < 6) & (df.index.hour < 6), index=df.index)` and then `print (df[mask])` - it filter dates less as 6 hour and less as 6 minutes. – jezrael Oct 27 '16 at 11:38
  • hours are coming properly but by minute its not coming properly suppose, whatever minutes are less than 1 min its coming like 01:00:23 like that... i want only less than 1 min its taking hours also.. – Neha B Oct 27 '16 at 12:35
  • Hmmm, you need only from `01:01:23` to `01:05:23` minutes? `01:00:23` is `Hours:Minutes:Seconds` Then use `mask = pd.Series((df.index.minute.isin([1,2,3,4,5]) ) , index=df.index)` – jezrael Oct 27 '16 at 12:38
  • It was untested, sorry. Use `mask = pd.Series(df.index.minute, index=df.index).isin([1,2,3,4,5])` – jezrael Oct 27 '16 at 12:43
  • still hours are also coming – Neha B Oct 27 '16 at 12:45
  • So need this `mask = pd.Series(df.index.minute, index=df.index).isin([1,2,3,4,5]) & pd.Series((df.index.hour < 6), index=df.index)` ? – jezrael Oct 27 '16 at 12:47
  • And actually what it is printing i want for 5 hours what is the average..like that for 10 min what is the average...Its giving separately .. actually i want for 5 hours the is the Average.. for 10 min this is the average... – Neha B Oct 27 '16 at 12:49
  • I dont understand. – jezrael Oct 27 '16 at 12:51
  • Sorry, Neha B - I am one idea - you can send me email (in my profile) with your sample data and desired output. Then can help you. I still dont understand what you need. thank you. – jezrael Oct 27 '16 at 12:53
  • see; if i say its coming less then 2 min data..So it is coming separately for 01:08 min its giving avg, BUT i want less then 2 min this is the AVG for 1 parameter and this is for 2nd parameter accordind to date 2016-09-01 00:01:08 53.8 45.8 2016-09-01 00:01:38 53.8 45.8 2016-09-01 00:02:08 53.8 45.8 2016-09-01 00:02:38 54.1 45.8 – Neha B Oct 27 '16 at 12:55
  • for 01-09-2016 what is the Avg for 10 min,,,for 02-09-2016 what is the Avg for 10 min like that – Neha B Oct 27 '16 at 12:57
  • what is your mail id so its helpful for me . – Neha B Oct 28 '16 at 04:46
0

I would first define a new column hour for readability, then groupBy it

df = pd.DataFrame.from_csv('/home/kk/Desktop/cal_Avg.csv',index_col=None)
df['hour']=df['Timestamp'].apply(lambda s:s[:-3])
df[['hour','Temprature']].groupBy('hour').mean()
Uri Goren
  • 13,386
  • 6
  • 58
  • 110
  • how to find missing timestamp..i want to find missing data ..my data is in every 30 second data and if some data is missing..how to find those data? – Neha B Nov 03 '16 at 11:29