I have a dataset like this:
Time | Node Label | Values
It is possible that at the same time there are 2 different values for the same node. I want to compare the values of these two rows and then substitute the first row with a new one. The second row has to be deleted.
Example with only 2 values:
Time | Node Label | Values
1 3 10 5
1 5 15 11
1 3 -6 7
2 3 8 4
2 5 3 9
2 3 1 1
It becomes:
Time | Node Label | Values
1 3 2 6
1 5 15 11
2 3 4.5 2.5
2 5 3 9
At the end I need that for a certain time I have each row that corresponds to a unique node label sorted in ascending order. For comparing the arrays and create the new to insert I’m simply utilizing the np.mean
function.
I have come up with this solution:
time_col = data[:, 0]
label_col = data[:, 1]
unique_labels, label_indices = np.unique(label_col, return_inverse=True)
unique_times, time_indices = np.unique(time_col, return_inverse=True)
grouped_indices = np.ravel_multi_index((time_indices, label_indices), dims=(len(unique_times), len(unique_labels)))
grouped_data = [data[grouped_indices == i] for i in range(len(unique_times) * len(unique_labels))]
# apply a function to each group to select the row with the highest values
highest_value = np.array([np.mean(group, 0) for group in grouped_data])
# create a new numpy array from the highest_value array
data = np.concatenate([highest_value[:, :2], highest_value[:, 2:]], axis=1)
It works but it's terribly slow. Obviously because I have multiples explicit for loops and certainly I'm also looping through unnecessary elements. I can only use numpy library.
For example with this dataset it takes maybe hours: https://shorturl.at/myIY9