Calculating the average of a column in csv per hour

Question

I have a csv file that contains data in the following format.

Layer   relative_time   Ht    BSs    Vge    Temp    Message
57986   2:52:46       0.00m   87    15.4    None    CMSG 
20729   0:23:02       45.06m  82    11.6    None    BMSG 
20729   0:44:17       45.06m  81    11.6    None    AMSG

I want to get read in this csv file and calculate the average BSs for every hour. My csv file is quite huge about 2000 values. However the values are not evenly distributed across every hour. For e.g.

I have 237 samples from hour 3 and only 4 samples from hour 6. Also I should mention that the BSs can be collected from multiple sources.The value always ranges from 20-100. Because of this it is giving a skewed result. For each hour I am calculating the sum of BSs for that hour divided by the number of samples in that hour. The primary purpose is to understand how BSs evolves over time.

But what is the common approach to this problem. Is this where people apply normalization? It would be great if someone could explain how to apply normalization in such a situation.

The code I am using for my processing is shown below. I believe the code below is correct.

#This 24x2 matrix will contain no of values recorded per hour per hour
hours_no_values = [[0 for i in range(24)] for j in range(2)]

#This 24x2 matrix will contain mean bss stats per hour
mean_bss_stats = [[0 for i in range(24)] for j in range(2)]


with open(PREFINAL_OUTPUT_FILE) as fin, open(FINAL_OUTPUT_FILE, "w",newline='') as f:
    reader = csv.reader(fin, delimiter=",")
    writer = csv.writer(f)
    header = next(reader)  # <--- Pop header out
    writer.writerow([header[0],header[1],header[2],header[3],header[4],header[5],header[6]]) # <--- Write header
    sortedlist = sorted(reader, key=lambda row: datetime.datetime.strptime(row[1],"%H:%M:%S"), reverse=True)
    print(sortedlist)
    for item in sortedlist:
        rel_time = datetime.datetime.strptime(item[1], "%H:%M:%S")
        if rel_time.hour not in hours_no_values[0]:
            print('item[6] {}'.format(item[6]))
            if 'MAN' in item[6]:
                print('Hour found {}'.format(rel_time.hour))
                hours_no_values[0][rel_time.hour] = rel_time.hour
                mean_bss_stats[0][rel_time.hour] = rel_time.hour

                mean_bss_stats[1][rel_time.hour] += int(item[3])

                hours_no_values[1][rel_time.hour] +=1
            else:
                pass
        else:
            if 'MAN' in item[6]:
                print('Hour Previous {}'.format(rel_time.hour))
                mean_bss_stats[1][rel_time.hour] += int(item[3])

                hours_no_values[1][rel_time.hour] +=1
            else:
                pass

    for i in range(0,24):
        if(hours_no_values[1][i] != 0):
            mean_bss_stats[1][i] = mean_bss_stats[1][i]/hours_no_values[1][i]    
        else:
            mean_bss_stats[1][i] = 0


    pprint.pprint('mean bss stats {} \n hour_no_values {} \n'.format(mean_bss_stats,hours_no_values))

The number of value per each hour are as follows for hours starting from 0 to 23.

[31, 117, 85, 237, 3, 67, 11, 4, 57, 0, 5, 21, 2, 5, 10, 8, 29, 7, 14, 3, 1, 1, 0, 0]

I would strong suggest you to use pandas dataframe for your problem. — Mangu Singh Rajpurohit, Nov 18 '15 at 07:03
@user2393267 - I don't see how using any package would solve the particular problem that I described. — liv2hak, Nov 18 '15 at 07:17

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

2

You could do it with pandas using groupby and aggregate to appropriate column:

import pandas as pd
import numpy as np
df = pd.read_csv("your_file")
df.groupby('hour')['BSs'].aggregate(np.mean)

If you don't have that column in initial dataframe you could add it:

df['hour'] = your_hour_data

numpy.mean - calculates the mean of the array.

Compute the arithmetic mean along the specified axis.

pandas.groupby

Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns

From pandas docs:

By “group by” we are referring to a process involving one or more of the following steps

Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure

Aggregation: computing a summary statistic (or statistics) about each group.
Some examples:

Compute group sums or means
Compute group sizes / counts

edited Jun 20 '20 at 09:12

Community

1
1

answered Nov 18 '15 at 07:19

Anton Protopopov

30,354
12
88
93

`hour` is not a field in my csv.I create that field by extractig hour from `relative_time` – liv2hak Nov 18 '15 at 07:22
so you could add that column to data frame as `df['hour']=relative_time` and than use that code – Anton Protopopov Nov 18 '15 at 07:23
Could you eloberate a bit more.I am not very familiar with Pandas or numpy – liv2hak Nov 18 '15 at 07:24
You could read docs and see examples [here](http://pandas.pydata.org/pandas-docs/stable/groupby.html) – Anton Protopopov Nov 18 '15 at 07:25
You could also find some useful information in [this question](http://stackoverflow.com/questions/15322632/python-pandas-df-groupby-agg-column-reference-in-agg) – Anton Protopopov Nov 18 '15 at 08:14

Calculating the average of a column in csv per hour

1 Answers1