I would like to find the indices at which several input values are matched in corresponding arrays. As an example, consider a time-series, for which a dataset contains multiple arrays: years
, months
, days
, and hours
. The values of the arrays are filled chronologically. Since the dataset is collected over the span of a few years, the years
array will be sorted but the remaining arrays will not be (since the values in hours
will only be sorted from 0-24
per day per month per year). Even though this dataset is collected over a span of several years, the dataset is not necessarily continuous - meaning that the number of days or hours between observations (or values as determined by consecutive indices) can be greater than one (but not always).
import numpy as np
years = np.array([2017, 2017, 2018, 2018, 2018, 2018])
months = np.array([12, 12, 1, 1, 1, 2]) # 1-12 months in the year
days = np.array([31, 31, 1, 2, 18, 1]) # 28 (or 29), 30, or 31 days per month
hours = np.array([4, 2, 17, 12, 3, 15]) # 0-23 hours per day
def get_matching_time_index(yy, mm, dd, hh):
""" This function returns an array of indices at which all values are matched in their corresponding arrays. """
res, = np.where((years == yy) & (months == mm) & (days == dd) & (hours == hh))
return res
idx_one = get_matching_time_index(2018, 1, 1, 17)
# >> [2]
idx_two = get_matching_time_index(2018, 2, 2, 0)
# >> []
idx_one = [2]
since the 2nd index of years
is 2018
, the 2nd index of months
is 1
, the 2nd index of days
is 1
, and the 2nd index of hours
is 17
. Since idx_two
came up empty, I would like to expand my search range to the find the index that corresponds to the next nearest time. Since the last index of each array is nearest to the corresponding values of the input datetime parameters, I would like the last index of these arrays to be returned (5
in this case).
One might be inclined to think that it's impossible to find the nearest group of values in multiple arrays. But in this case, the hours take precedence over the days, which take precedence over the months, etc. (since an observation 3 hours off from the input time is nearer in time than an observation 3 days off from the input time).
I found a lot of nifty solutions that will work on one array via this post on StackOverflow, but not for a condition that works on multiple arrays. Furthermore, the most efficient solutions posted assume that the array is sorted, whereas the only sorted array in the case of my example is the years.
I suppose I can repeat the operations suggested in that post to repeat the same procedure on each of the multiple arrays - this way, I can find the indices that are common for each of the arrays. Then, one can take the difference of input time-parameters and the time-parameters that are found at the common indices. Starting from the arrays of smaller units (hours
in this case), one can pick the index that corresponds to the smallest difference. BUT, I feel that there is a simpler approach that may also be more efficient.
How can I better approach this problem to find the index that corresponds to the nearest grouping of data points via multiple arrays? Is this where a multi-dimensional array becomes handy?
EDIT: On second thought, one can convert all time parameters into elapsed hours. Then, one can find the index corresponding the observation that is nearest in elapsed hours. Regardless, I am still curious about various ways of approaching this problem.