2

I have a set of 100 data files containing information about particles (ID, velocity, position etc). I need to pick out 10000 specific particles having certain ID numbers from each of them. The way i am doing it is as follows

for i in range(n_files+1):
    data= load_data_file(i, datatype="double_precision")
    for j in chosen_id_arr:
        my_index= np.where((particleID_in_data)==j)
        identity.append(ID[my_index])
        x.append(x_component[my_index])
        y.append(y_component[my_index])
        z.append(z_component[my_index])


The list "chosen_id_array" contains all such IDs. The data files are structured with respect to list index.

This snippet runs very slow for some reason, i was looking for a faster more efficient alternative for this. Thank you very much in advance. :)

  • Does this answer your question? [faster alternative to numpy.where?](https://stackoverflow.com/questions/33281957/faster-alternative-to-numpy-where) – imbr Jun 18 '20 at 15:56
  • You need to tell a bit more about the structure of the data. Are the IDs positive integers and unique in each file? How many entries are there in `chosen_id_arr` and in `particleID_in_data`? Please provide an example dataset (use `np.random.randint` and/or `np.random.shuffle`). – Han-Kwang Nienhuys Jun 18 '20 at 20:09
  • @eusoubrasileiro I have gone through that post, but no, that doesn't help in my case. – noobprogrammer Jun 19 '20 at 06:22
  • @Han-KwangNienhuysSo, the IDs are unique for each particle and are positive integers. Consider four seperate arrays, in each data file, first array is ID the other three are the x,y,z values. Suppose I want the information about the particle having ID 10. The way to do it is, (a) Look for the ID number 10 in the ID array and get the index. (b) Look at the x, y, z arrays at that index to get the value for that particle. I do that over each data file to track the particle 10 over time(data files are saved at different time intervals). – noobprogrammer Jun 19 '20 at 06:26
  • Are you able/allowed to change the data type stored in the files? For example one could think of a sparse array with the particle ID as index, such that there is no real search required. – David Wierichs Jun 19 '20 at 23:51
  • @DavidWierichs Yes, technically i can change the datatype. But then how would i search the x y z values without the index of the the ID? – noobprogrammer Jun 20 '20 at 05:44
  • For a sparse array, you then would just have `x, y, z=data[ID]`, right? Or for a dictionary by storing them in the value of the ID, which would be the key: `data={id_i: [x_i, y_i, z_i]}` – David Wierichs Jun 20 '20 at 10:19
  • @DavidWierichs Yes, i understand what you are saying. So, would any of these methods be faster than np.where()? If yes, please elaborate on how to do it. – noobprogrammer Jun 21 '20 at 07:07

1 Answers1

1

Using a dictionary, you could store the positional information attributed to the particle ID, making use of O(1) lookup scaling for dictionaries:

# What the data in a single file would look like:
data = {1:[0.5,0.1,1.], 4:[0.4,-0.2,0.1], ...}
# A lookup becomes very simple syntactically:
for ID in chosen_id_arr:
    x, y, z = data[ID]
    # Here you can process the obtained x,y,z.

This is much faster than the numpy lookup. Regarding the processing of the location data within the loop you could consider to have separate lists of positions for distinct particle IDs but that's not within the scope of the question, I think. The pandas package could also be of help there.

David Wierichs
  • 545
  • 4
  • 11