0

I have X and Y data, where one represents the other respectively, example y [0] = x [0], and a script that divided this data into 5 categories: low, medium, high, ultra and critical and calculates the existing midpoint in the (sum (category) / len (category) if len (category) else 0) categories, thus generating a histogram, I would like to take this data in a performance way, where my Y was divided into lists in the interval of 1 hour (60 data) following the principle that each data corresponds to the time of 1 minute, and from these lists divided in the period of 1 hour the categories low, medium, high were created , ultra and critical and midpoints of each, so there would be the following:

Groups in the 1 hour interval grouped by the frequencies low, medium, high, ultra and critical with their respective midpoints.

In the current way, does he calculate as a whole, without creating a temporal distinction, any suggestions?

Data:

Y = [68 68 69 68 69 70 71 75 72 73 72 72 73 72 72 73 70 71 73 72 72 71 69 68
     69 68 69 68 69 68 68 68 68 69 68 69 70 71 75 72 69 68 68 68 69 68 69 70
     71 75 72 69 68 69 68 69 68 69 68 68 60 60 61 60 61 65 69 69 72 73 72 72
     73 72 72 73 70 71 73 75 78 80 82 84 87 84 84 83 82 79 78 76 74 73 72 72
     72 71 75 72 69 68 68 68 69 68 69 70 71 75 72 69 68 69 68 69 68 69 68 68]

X =  ['2020-01-01 09:00:01' '2020-01-01 09:01:01' '2020-01-01 09:02:01'
      '2020-01-01 09:03:01' '2020-01-01 09:04:01' '2020-01-01 09:05:01'
      '2020-01-01 09:06:01' '2020-01-01 09:07:01' '2020-01-01 09:08:01'
      '2020-01-01 09:09:01' '2020-01-01 09:10:01' '2020-01-01 09:11:01'
      '2020-01-01 09:12:01' '2020-01-01 09:13:01' '2020-01-01 09:14:01'
      ...
      '2020-01-02 09:45:01' '2020-01-02 09:46:01' '2020-01-02 09:47:01'
      '2020-01-02 09:48:01' '2020-01-02 09:49:01' '2020-01-02 09:50:01'
      '2020-01-02 09:51:01' '2020-01-02 09:52:01' '2020-01-02 09:53:01'
      '2020-01-02 09:54:01' '2020-01-02 09:55:01' '2020-01-02 09:56:01'
      '2020-01-02 09:57:01' '2020-01-02 09:58:01' '2020-01-02 09:59:01']

Actual:

Low = []
Medium = []
High = []
Ultra = []
Critical = []

for val in Y:
    if val <= 59:
        Low.append(val)
    if val >= 60 and val <= 69:
        Medium.append(val)
    if val >= 70 and val <=79:
        High.append(val)
    if val >= 80 and val <=89:
        Ultra.append(val)
    if val >= 90:
        Critical.append(val)

pmLow = sum(Low) / len(Low) if len(Low) else 0
pmMedium =  sum(Medium) / len(Medium) if len(Medium) else 0
pmHigh =  sum(High) / len(High) if len(High) else 0
pmUltra =  sum(Ultra) / len(Ultra) if len(Ultra) else 0
pmCritical =  sum(Critical) / len(Critical) if len(Critical) else 0

sizeLow = len(Low)
sizeMedium = len(Medium)
sizeHigh =  len(High)
sizeUltra =  len(Ultra)
sizeCritical =  len(Critical)

axisX = []
axisY = []
axisX.append(pmLow,pmMedium,pmHigh,pmUltra,pmCritical)       
axisY.append(sizeLow,sizeMedium,sizeHigh,sizeUltra,sizeCritical)

Current output:

enter image description here

Fixes for my data set:

data = pd.read_csv('dir/data.csv')

X = data.iloc[:, 0].values
Y = data.iloc[:, 1].values

df = pd.DataFrame({'Time': X, 'Value' :  Y})

# adding 2 hr, 45 min so I have a total of 3 hour intervals
time = datetime.strptime(df['Time'].loc[len(df)-1], '%Y-%m-%d %H:%M:%S') + pd.Timedelta(minutes = 1)
for i in range(1, 119):
    new_row = {'Time' : time, 'Value' : random.choice(df['Value'])}
    df = df.append(new_row, ignore_index = True)


def get_time_mask(start_time, df_time):
    # this will be a range of 61 values (i.e. 9:00 to 10:00 including 10:00)
    # if you don't want to include the end on hour, change minutes = 59
    start_time = datetime.strptime(start_time, '%Y-%m-%d %H:%M:%S')
    end_time = start_time + pd.Timedelta(minutes = 60)
    return (pd.to_datetime(df_time, format='%Y-%m-%d %H:%M:%S') >= start_time) & (pd.to_datetime(df_time, format='%Y-%m-%d %H:%M:%S') <= end_time)
Luis Henrique
  • 701
  • 15
  • 36

1 Answers1

1

Here's an example in pandas, I don't know what your histogram is supposed to look like, but I can get you up to the point of where it sounds like you'll be able to create it yourself.

First convert your data to a dictionary so pandas can load it:

import pandas as pd
import numpy as np
df = pd.DataFrame({'Time': ['2020-01-01, 09:00:01', '2020-01-01, 09:01:01', '2020-01-01, 09:02:01', '2020-01-01, 09:03:01', '2020-01-01, 09:04:01', '2020-01-01, 09:05:01', '2020-01-01, 09:06:01', '2020-01-01, 09:07:01', '2020-01-01, 09:08:01', '2020-01-01, 09:09:01', '2020-01-01, 09:10:01', '2020-01-01, 09:11:01', '2020-01-01, 09:12:01', '2020-01-01, 09:13:01', '2020-01-01, 09:14:01'], 'Value' : [68, 68, 69, 68, 69, 70, 71, 75, 72, 73, 72, 72, 73, 72, 72]})

I only grabbed a few points of yours, but I am going to extrapolate it outwards so to demo how you can create a list of data frames grouped by the hour:

# adding 2 hr, 45 min so I have a total of 3 hour intervals
for i in range(1, 167):
  new_row = {'Time' : df['Time'].loc[len(df)-1] + pd.Timedelta(minutes = 1), 'Value' : random.choice(df['Value'])}
  df = df.append(new_row, ignore_index = True)

These are the functions you'll want to use once your data is converted to pandas data frame:

def get_time_mask(start_time, df_time):
  # this will be a range of 61 values (i.e. 9:00 to 10:00 including 10:00)
  # if you don't want to include the end on hour, change minutes = 59
  end_time = start_time + pd.Timedelta(minutes = 60)
  return (df_time >= start_time) & (df_time <= end_time)

def group_dfs(df):
  dfs_grouped = []
  # if not including on hour, change to 60
  while(len(df) > 61):

    # select a chunk of the df from it's starting point out to 60 minutes
    time_mask = get_time_mask(df.head(1)['Time'].values[0], df['Time'])

    #append that chunk to the list
    dfs_grouped.append(df[time_mask])

    #set the data frame equal to everything we did not grab
    df = df[np.invert(time_mask)]

  #while loop repeats until the data frame has less values than hr int
  #grab the remaining if > 0 append to list
  if(len(df) != 0):
    dfs_grouped.append(df)
  return dfs_grouped


def apply_category(val):
  categories = pd.Series(['low', 'medium', 'high', 'ultra', 'critical'])
  conditions = [val <= 59, (val >= 60 and val <= 69), (val >= 70 and val <=79), (val >= 80 and val <=89), val >= 90]
  return categories[conditions].values[0]

def categorize(df):
  df['Category'] = df['Value'].apply(apply_category)
  return df

def get_midpoints(df):
  return df.groupby('Category').mean()['Value']

dfs_grouped = group_dfs(df)

#change in place
for df in dfs_grouped:
  df['Category'] = df['Value'].apply(apply_category)

dfs_midpoint = [get_midpoints(df) for df in dfs_grouped]

Output:

dfs_grouped:

# list of len = 3, each is a df

[                  Time  Value
0  2020-01-01 09:00:01     68
1  2020-01-01 09:01:01     68
2  2020-01-01 09:02:01     69
3  2020-01-01 09:03:01     68
4  2020-01-01 09:04:01     69
..                 ...    ...
56 2020-01-01 09:56:01     71
57 2020-01-01 09:57:01     72
58 2020-01-01 09:58:01     73
59 2020-01-01 09:59:01     72
60 2020-01-01 10:00:01     73

[61 rows x 2 columns],                    Time  Value
61  2020-01-01 10:01:01     72
62  2020-01-01 10:02:01     73
63  2020-01-01 10:03:01     73
64  2020-01-01 10:04:01     70
65  2020-01-01 10:05:01     72
..                  ...    ...
117 2020-01-01 10:57:01     73
118 2020-01-01 10:58:01     72
119 2020-01-01 10:59:01     72
120 2020-01-01 11:00:01     72
121 2020-01-01 11:01:01     73

[61 rows x 2 columns],                    Time  Value
122 2020-01-01 11:02:01     69
123 2020-01-01 11:03:01     72
124 2020-01-01 11:04:01     72
125 2020-01-01 11:05:01     73
126 2020-01-01 11:06:01     73
127 2020-01-01 11:07:01     72
128 2020-01-01 11:08:01     72
129 2020-01-01 11:09:01     70
130 2020-01-01 11:10:01     72
131 2020-01-01 11:11:01     73
132 2020-01-01 11:12:01     73
133 2020-01-01 11:13:01     72
134 2020-01-01 11:14:01     73
135 2020-01-01 11:15:01     73
136 2020-01-01 11:16:01     72
137 2020-01-01 11:17:01     72
138 2020-01-01 11:18:01     72
139 2020-01-01 11:19:01     70
140 2020-01-01 11:20:01     73
141 2020-01-01 11:21:01     73
142 2020-01-01 11:22:01     72
143 2020-01-01 11:23:01     70
144 2020-01-01 11:24:01     72
145 2020-01-01 11:25:01     72
146 2020-01-01 11:26:01     72
147 2020-01-01 11:27:01     73
148 2020-01-01 11:28:01     72
149 2020-01-01 11:29:01     73
150 2020-01-01 11:30:01     73
151 2020-01-01 11:31:01     68
152 2020-01-01 11:32:01     70
153 2020-01-01 11:33:01     72
154 2020-01-01 11:34:01     72
155 2020-01-01 11:35:01     73
156 2020-01-01 11:36:01     72
157 2020-01-01 11:37:01     72
158 2020-01-01 11:38:01     72
159 2020-01-01 11:39:01     69
160 2020-01-01 11:40:01     72
161 2020-01-01 11:41:01     72
162 2020-01-01 11:42:01     72
163 2020-01-01 11:43:01     73
164 2020-01-01 11:44:01     73
165 2020-01-01 11:45:01     73
166 2020-01-01 11:46:01     72
167 2020-01-01 11:47:01     72
168 2020-01-01 11:48:01     72
169 2020-01-01 11:49:01     73
170 2020-01-01 11:50:01     70
171 2020-01-01 11:51:01     72
172 2020-01-01 11:52:01     75
173 2020-01-01 11:53:01     70
174 2020-01-01 11:54:01     72
175 2020-01-01 11:55:01     72
176 2020-01-01 11:56:01     72
177 2020-01-01 11:57:01     69
178 2020-01-01 11:58:01     72
179 2020-01-01 11:59:01     73
180 2020-01-01 12:00:01     73]

after the loop dfs_grouped:

[                  Time  Value Category
0  2020-01-01 09:00:01     68   medium
1  2020-01-01 09:01:01     68   medium
2  2020-01-01 09:02:01     69   medium
3  2020-01-01 09:03:01     68   medium
4  2020-01-01 09:04:01     69   medium
..                 ...    ...      ...
56 2020-01-01 09:56:01     71     high
57 2020-01-01 09:57:01     72     high
58 2020-01-01 09:58:01     73     high
59 2020-01-01 09:59:01     72     high
60 2020-01-01 10:00:01     73     high

[61 rows x 3 columns],                    Time  Value Category
61  2020-01-01 10:01:01     72     high
62  2020-01-01 10:02:01     73     high
63  2020-01-01 10:03:01     73     high
64  2020-01-01 10:04:01     70     high
65  2020-01-01 10:05:01     72     high
..                  ...    ...      ...
117 2020-01-01 10:57:01     73     high
118 2020-01-01 10:58:01     72     high
119 2020-01-01 10:59:01     72     high
120 2020-01-01 11:00:01     72     high
121 2020-01-01 11:01:01     73     high

[61 rows x 3 columns],                    Time  Value Category
122 2020-01-01 11:02:01     69   medium
123 2020-01-01 11:03:01     72     high
124 2020-01-01 11:04:01     72     high
125 2020-01-01 11:05:01     73     high
126 2020-01-01 11:06:01     73     high
127 2020-01-01 11:07:01     72     high
128 2020-01-01 11:08:01     72     high
129 2020-01-01 11:09:01     70     high
130 2020-01-01 11:10:01     72     high
131 2020-01-01 11:11:01     73     high
132 2020-01-01 11:12:01     73     high
133 2020-01-01 11:13:01     72     high
134 2020-01-01 11:14:01     73     high
135 2020-01-01 11:15:01     73     high
136 2020-01-01 11:16:01     72     high
137 2020-01-01 11:17:01     72     high
138 2020-01-01 11:18:01     72     high
139 2020-01-01 11:19:01     70     high
140 2020-01-01 11:20:01     73     high
141 2020-01-01 11:21:01     73     high
142 2020-01-01 11:22:01     72     high
143 2020-01-01 11:23:01     70     high
144 2020-01-01 11:24:01     72     high
145 2020-01-01 11:25:01     72     high
146 2020-01-01 11:26:01     72     high
147 2020-01-01 11:27:01     73     high
148 2020-01-01 11:28:01     72     high
149 2020-01-01 11:29:01     73     high
150 2020-01-01 11:30:01     73     high
151 2020-01-01 11:31:01     68   medium
152 2020-01-01 11:32:01     70     high
153 2020-01-01 11:33:01     72     high
154 2020-01-01 11:34:01     72     high
155 2020-01-01 11:35:01     73     high
156 2020-01-01 11:36:01     72     high
157 2020-01-01 11:37:01     72     high
158 2020-01-01 11:38:01     72     high
159 2020-01-01 11:39:01     69   medium
160 2020-01-01 11:40:01     72     high
161 2020-01-01 11:41:01     72     high
162 2020-01-01 11:42:01     72     high
163 2020-01-01 11:43:01     73     high
164 2020-01-01 11:44:01     73     high
165 2020-01-01 11:45:01     73     high
166 2020-01-01 11:46:01     72     high
167 2020-01-01 11:47:01     72     high
168 2020-01-01 11:48:01     72     high
169 2020-01-01 11:49:01     73     high
170 2020-01-01 11:50:01     70     high
171 2020-01-01 11:51:01     72     high
172 2020-01-01 11:52:01     75     high
173 2020-01-01 11:53:01     70     high
174 2020-01-01 11:54:01     72     high
175 2020-01-01 11:55:01     72     high
176 2020-01-01 11:56:01     72     high
177 2020-01-01 11:57:01     69   medium
178 2020-01-01 11:58:01     72     high
179 2020-01-01 11:59:01     73     high
180 2020-01-01 12:00:01     73     high]

And dfs_midpoint

[Category
high      72.142857
medium    68.400000
Name: Value, dtype: float64, Category
high      72.189655
medium    68.333333
Name: Value, dtype: float64, Category
high      72.163636
medium    68.750000
Name: Value, dtype: float64]

To access elements of the dfs_midpoint, keep in mind it's a list of numpy.ndarrays so if I wanted the value of high from the first group:

dfs_midpoint[0]['high']
> 72.14285714285714

# or 

dfs_midpoint[0][0]
> 72.14285714285714
Anna Nevison
  • 2,709
  • 6
  • 21