0

I have a dataset like this:

Time | Node Label | Values

It is possible that at the same time there are 2 different values for the same node. I want to compare the values of these two rows and then substitute the first row with a new one. The second row has to be deleted.

Example with only 2 values:

Time | Node Label | Values

 1           3       10    5
 1           5       15    11
 1           3       -6    7
 2           3        8    4
 2           5        3    9
 2           3        1    1

It becomes:

Time | Node Label | Values

 1           3       2    6
 1           5       15   11
 2           3       4.5  2.5
 2           5        3   9

At the end I need that for a certain time I have each row that corresponds to a unique node label sorted in ascending order. For comparing the arrays and create the new to insert I’m simply utilizing the np.mean function.

I have come up with this solution:

time_col = data[:, 0]
label_col = data[:, 1]


unique_labels, label_indices = np.unique(label_col, return_inverse=True)
unique_times, time_indices = np.unique(time_col, return_inverse=True)

grouped_indices = np.ravel_multi_index((time_indices, label_indices), dims=(len(unique_times), len(unique_labels)))

grouped_data = [data[grouped_indices == i] for i in range(len(unique_times) * len(unique_labels))]
    
# apply a function to each group to select the row with the highest values
highest_value = np.array([np.mean(group, 0) for group in grouped_data])
    
# create a new numpy array from the highest_value array
data = np.concatenate([highest_value[:, :2], highest_value[:, 2:]], axis=1)

It works but it's terribly slow. Obviously because I have multiples explicit for loops and certainly I'm also looping through unnecessary elements. I can only use numpy library.

For example with this dataset it takes maybe hours: https://shorturl.at/myIY9

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • How are the values stored? What's the data type of that column? Why isn't it two separate columns? – Tim Roberts Mar 20 '23 at 17:15
  • Sorry for my poor formatting, I was trying to fix it. Thank you for your help. Data is a numpy array loaded from a .csv file. All the columns are imported as float numbers. The first two columns are respectively times and node labels. The remaining columns are nodal values – Alessio Mar 20 '23 at 17:20

1 Answers1

0

First, you could try with this solution.

After the groups are created, you can compute the mean of each group using an explicit for loop (which will still be better than evaluating [data[grouped_indices == i] for i in range(len(unique_times) * len(unique_labels))]).

Another solution, perhaps faster, would be to use numpy.bincount to group and also average the data.

Here’s an example

import numpy as np

# Create a sample 2D NumPy array
arr = np.array([[1, 3], [2, 4], [1, 5], [3, 5], [2, 6]])

# Get unique values and their corresponding indices
unq, inv = np.unique(arr[:, 0], return_inverse=True)

# Compute mean values for each group
weights = np.ones_like(inv) / inv.size
mean_arr = np.column_stack((unq, np.bincount(inv, weights=weights * arr[:, 1], minlength=len(unq)) / np.bincount(inv, weights=weights, minlength=len(unq))))

print(mean_arr)

In the case of an array with multiple value columns I think you should repeat the process for each column (without repeating the unique call).

scarpma
  • 29
  • 3