2

I have a root file that I open with 2000 entries, and variable amount of subentries and in each column is a different variable. Lets say I am only interested in 5 of those. I want to put them in an array with np.shape(array)=(2000,250,5). The 250 is plenty to contain all subentrys per entry.

The root file is converted into a dictionary by uproot DATA=[variablename:[array of entries [array of subentries]]

I create an array np.zeros(2000,250,5) and fill it with the data I want, but it takes about 500ms and I need a solution that scales as I aim for 1 million entries later on. I found multiple solutions, but my lowest was about 300ms

lim_i=len(N_DATA["nTrack"])
i=0
INPUT_ARRAY=np.zeros((lim_i,500,5))
for l in range(len(INPUT_ARRAY)):
    while i < lim_i:
        EVENT=np.zeros((500,5))
        k=0
        lim_k=len(TRACK_DATA["Track_pt"][i])
        while k<lim_k:
            EVENT[k][0]=TRACK_DATA["Track_pt"][i][k]
            EVENT[k][1]=TRACK_DATA["Track_phi"][i][k]
            EVENT[k][2]=TRACK_DATA["Track_eta"][i][k]
            EVENT[k][3]=TRACK_DATA["Track_dxy"][i][k]
            EVENT[k][4]=TRACK_DATA["Track_charge"][i][k]
            k+=1
        INPUT_ARRAY[i]=EVENT
        i+=1
INPUT_ARRAY

2 Answers2

2

Observation 1: we can assign directly to the appropriate sub-arrays of INPUT_ARRAY[i], instead of creating EVENT as a proxy for INPUT_ARRAY[i] and then copying that in. (I will also set your variable names in lowercase, to follow normal conventions.

lim_i = len(n_data["nTrack"])
i = 0
input_array = np.zeros((lim_i,500,5))
for l in range(len(input_array)):
    while i < lim_i:
        k = 0
        lim_k = len(track_data["Track_pt"][i])
        while k < lim_k:
            input_array[i][k][0] = track_data["Track_pt"][i][k]
            input_array[i][k][1] = track_data["Track_phi"][i][k]
            input_array[i][k][2] = track_data["Track_eta"][i][k]
            input_array[i][k][3] = track_data["Track_dxy"][i][k]
            input_array[i][k][4] = track_data["Track_charge"][i][k]
            k += 1
        i += 1

Observation 2: the assignments we make in the innermost loop have the same basic structure. It would be nice if we could take the various entries of the TRACK_DATA dict (which are 2-dimensional data) and stack them together. Numpy has a convenient (and efficient) built-in for stacking 2-dimensional data along the third dimension: np.dstack. Having prepared that 3-dimensional array, we can just copy in from it mechanically:

track_array = np.dstack((
    track_data['Track_pt'],
    track_data['Track_phi'],
    track_data['Track_eta'],
    track_data['Track_dxy'],
    track_data['Track_charge']
))
lim_i = len(n_data["nTrack"])
i = 0
input_array = np.zeros((lim_i,500,5))
for l in range(len(input_array)):
    while i < lim_i:
        k = 0
        lim_k = len(track_data["Track_pt"][i])
        while k < lim_k:
            input_array[i][k][0] = track_data[i][k][0]
            input_array[i][k][1] = track_data[i][k][1]
            input_array[i][k][2] = track_data[i][k][2]
            input_array[i][k][3] = track_data[i][k][3]
            input_array[i][k][4] = track_data[i][k][4]
            k += 1
        i += 1

Observation 3: but now, the purpose of our innermost loop is simply to copy an entire chunk of track_data along the last dimension. We could just do that directly:

track_array = np.dstack((
    track_data['Track_pt'],
    track_data['Track_phi'],
    track_data['Track_eta'],
    track_data['Track_dxy'],
    track_data['Track_charge']
))
lim_i = len(n_data["nTrack"])
i = 0
input_array = np.zeros((lim_i,500,5))
for l in range(len(input_array)):
    while i < lim_i:
        k = 0
        lim_k = len(track_data["Track_pt"][i])
        while k < lim_k:
            input_array[i][k] = track_data[i][k]
            k += 1
        i += 1

Observation 4: But actually, the same reasoning applies to the other two dimensions of the array. Clearly, our intent is to copy the entire array produced from the dstack; and that is already a new array, so we could just use it directly.

input_array = np.dstack((
    track_data['Track_pt'],
    track_data['Track_phi'],
    track_data['Track_eta'],
    track_data['Track_dxy'],
    track_data['Track_charge']
))
Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
  • It may happen that you need to convert the values from the `track_data` dict into Numpy arrays explicitly. I have never heard of the `awkward-array` or `uproot` tools you are using, so I don't know what they'll do here. I'm just covering the key idea, which is: *use Numpy tools* to manipulate Numpy data. – Karl Knechtel Nov 13 '20 at 01:18
  • But as an aside, while you should avoid explicitly iterating over Numpy arrays yourself (there is practically guaranteed to be a built-in Numpy thing that just does what you want, and probably much faster than native Python can), it would be a good idea to take this opportunity to learn [how to loop like a native](https://nedbatchelder.com/text/iter.html). – Karl Knechtel Nov 13 '20 at 01:20
  • np.dstack seems to be the tright tool here, but I couldnt follow your observations (lack of my python skill, I guess): " for i in range(eventrange): track_array = np.dstack(( track_data['Track_pt'][i], track_data['Track_phi'][i], track_data['Track_eta'][i], track_data['Track_dxy'][i], track_data['Track_charge'][i])) input_array[i]=track_array track_array" That is where I stand right now. for explanation this is about particle physics. the tracks are the measured particles and I want their variables to in one object. but there are – SergeantIdiot Nov 13 '20 at 01:57
  • several events in one file. so I want the objects/tracks to be in seperate arrays. the number of tracks is variable. The Data shall become an Input for a neural network therefore I want the dimension for each event to be the same and if there are less tracks than 250, their values would remain 0. the code I have noe puts the tracks into objects but not sorted by event. my only Idea is to iterate over them, but it is 100% the wrong idea. – SergeantIdiot Nov 13 '20 at 02:01
  • I think you should ask a new question. – Karl Knechtel Nov 13 '20 at 02:31
2

Taking note of fKarl Knechtel's second comment, "You should avoid explicitly iterating over Numpy arrays yourself (there is practically guaranteed to be a built-in Numpy thing that just does what you want, and probably much faster than native Python can)," there is a way to do this with array-at-a-time programming, but not in NumPy. The reason Uproot returns Awkward Arrays is because you need a way to deal with variable-length data efficiently.

I don't have your file, but I'll start with a similar one:

>>> import uproot4
>>> import skhep_testdata
>>> events = uproot4.open(skhep_testdata.data_path("uproot-HZZ.root"))["events"]

The branches that start with "Muon_" in this file have the same variable-length structure as in your tracks. (The C++ typename is a dynamically sized array, interpreted in Python "as jagged.")

>>> events.show(filter_name="Muon_*")
name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
Muon_Px              | float[]                  | AsJagged(AsDtype('>f4'))
Muon_Py              | float[]                  | AsJagged(AsDtype('>f4'))
Muon_Pz              | float[]                  | AsJagged(AsDtype('>f4'))
Muon_E               | float[]                  | AsJagged(AsDtype('>f4'))
Muon_Charge          | int32_t[]                | AsJagged(AsDtype('>i4'))
Muon_Iso             | float[]                  | AsJagged(AsDtype('>f4'))

If you just ask for these arrays, you get them as an Awkward Array.

>>> muons = events.arrays(filter_name="Muon_*")
>>> muons
<Array [{Muon_Px: [-52.9, 37.7, ... 0]}] type='2421 * {"Muon_Px": var * float32,...'>

To put them to better use, let's import Awkward Array and start by asking for its type.

>>> import awkward1 as ak
>>> ak.type(muons)
2421 * {"Muon_Px": var * float32, "Muon_Py": var * float32, "Muon_Pz": var * float32, "Muon_E": var * float32, "Muon_Charge": var * int32, "Muon_Iso": var * float32}

What does this mean? It means you have 2421 records with fields named "Muon_Px", etc., that each contain variable-length lists of float32 or int32, depending on the field. We can look at one of them by converting it to Python lists and dicts.

>>> muons[0].tolist()
{'Muon_Px': [-52.89945602416992, 37.7377815246582],
 'Muon_Py': [-11.654671669006348, 0.6934735774993896],
 'Muon_Pz': [-8.16079330444336, -11.307581901550293],
 'Muon_E': [54.77949905395508, 39.401695251464844],
 'Muon_Charge': [1, -1],
 'Muon_Iso': [4.200153350830078, 2.1510612964630127]}

(You could have made these lists of records, rather than records of lists, by passing how="zip" to TTree.arrays or using ak.unzip and ak.zip in Awkward Array, but that's tangential to the padding that you want to do.)

The problem is that the lists have different lengths. NumPy doesn't have any functions that will help us here because it deals entirely in rectilinear arrays. Therefore, we need a function that's specific to Awkward Array, ak.num.

>>> ak.num(muons)
<Array [{Muon_Px: 2, ... Muon_Iso: 1}] type='2421 * {"Muon_Px": int64, "Muon_Py"...'>

This is telling us the number of elements in each list, per field. For clarity, look at the first one:

>>> ak.num(muons)[0].tolist()
{'Muon_Px': 2, 'Muon_Py': 2, 'Muon_Pz': 2, 'Muon_E': 2, 'Muon_Charge': 2, 'Muon_Iso': 2}

You want to turn these irregular lists into regular lists that all have the same size. That's called "padding." Again, there's a function for that, but we first need to get the maximum number of elements, so that we know how much to pad it by.

>>> ak.max(ak.num(muons))
4

So let's make them all length 4.

>>> ak.pad_none(muons, ak.max(ak.num(muons)))
<Array [{Muon_Px: [-52.9, 37.7, ... None]}] type='2421 * {"Muon_Px": var * ?floa...'>

Again, let's look at the first one to understand what we have.

{'Muon_Px': [-52.89945602416992, 37.7377815246582, None, None],
 'Muon_Py': [-11.654671669006348, 0.6934735774993896, None, None],
 'Muon_Pz': [-8.16079330444336, -11.307581901550293, None, None],
 'Muon_E': [54.77949905395508, 39.401695251464844, None, None],
 'Muon_Charge': [1, -1, None, None],
 'Muon_Iso': [4.200153350830078, 2.1510612964630127, None, None]}

You wanted to pad them with zeros, not None, so we convert the missing values into zeros.

>>> ak.fill_none(ak.pad_none(muons, ak.max(ak.num(muons))), 0)[0].tolist()
{'Muon_Px': [-52.89945602416992, 37.7377815246582, 0.0, 0.0],
 'Muon_Py': [-11.654671669006348, 0.6934735774993896, 0.0, 0.0],
 'Muon_Pz': [-8.16079330444336, -11.307581901550293, 0.0, 0.0],
 'Muon_E': [54.77949905395508, 39.401695251464844, 0.0, 0.0],
 'Muon_Charge': [1, -1, 0, 0],
 'Muon_Iso': [4.200153350830078, 2.1510612964630127, 0.0, 0.0]}

Finally, NumPy doesn't have records (other than the structured array, which also implies that the columns are contiguous in memory; Awkward Array's "records" are abstract). So let's unzip what we have into six separate arrays.

>>> arrays = ak.unzip(ak.fill_none(ak.pad_none(muons, ak.max(ak.num(muons))), 0))
>>> arrays
(<Array [[-52.9, 37.7, 0, 0, ... 23.9, 0, 0, 0]] type='2421 * var * float64'>,
 <Array [[-11.7, 0.693, 0, 0, ... 0, 0, 0]] type='2421 * var * float64'>,
 <Array [[-8.16, -11.3, 0, 0, ... 0, 0, 0]] type='2421 * var * float64'>,
 <Array [[54.8, 39.4, 0, 0], ... 69.6, 0, 0, 0]] type='2421 * var * float64'>,
 <Array [[1, -1, 0, 0], ... [-1, 0, 0, 0]] type='2421 * var * int64'>,
 <Array [[4.2, 2.15, 0, 0], ... [0, 0, 0, 0]] type='2421 * var * float64'>)

Note that this one line does everything from the initial data-pull from Uproot (muons). I'm not going to profile it now, but you'll find that this one line is considerably faster than explicit looping.

Now what we have is semantically equivalent to six NumPy arrays, so we'll just cast them as NumPy. (Attempts to do so with irregular data would fail. You have to explicitly pad the data.)

>>> numpy_arrays = [ak.to_numpy(x) for x in arrays]
>>> numpy_arrays
[array([[-52.89945602,  37.73778152,   0.        ,   0.        ],
        [ -0.81645936,   0.        ,   0.        ,   0.        ],
        [ 48.98783112,   0.82756668,   0.        ,   0.        ],
        ...,
        [-29.75678635,   0.        ,   0.        ,   0.        ],
        [  1.14186978,   0.        ,   0.        ,   0.        ],
        [ 23.9132061 ,   0.        ,   0.        ,   0.        ]]),
 array([[-11.65467167,   0.69347358,   0.        ,   0.        ],
        [-24.40425873,   0.        ,   0.        ,   0.        ],
        [-21.72313881,  29.8005085 ,   0.        ,   0.        ],
        ...,
        [-15.30385876,   0.        ,   0.        ,   0.        ],
        [ 63.60956955,   0.        ,   0.        ,   0.        ],
        [-35.66507721,   0.        ,   0.        ,   0.        ]]),
 array([[ -8.1607933 , -11.3075819 ,   0.        ,   0.        ],
        [ 20.19996834,   0.        ,   0.        ,   0.        ],
        [ 11.16828537,  36.96519089,   0.        ,   0.        ],
        ...,
        [-52.66374969,   0.        ,   0.        ,   0.        ],
        [162.17631531,   0.        ,   0.        ,   0.        ],
        [ 54.71943665,   0.        ,   0.        ,   0.        ]]),
 array([[ 54.77949905,  39.40169525,   0.        ,   0.        ],
        [ 31.69044495,   0.        ,   0.        ,   0.        ],
        [ 54.73978806,  47.48885727,   0.        ,   0.        ],
        ...,
        [ 62.39516068,   0.        ,   0.        ,   0.        ],
        [174.20863342,   0.        ,   0.        ,   0.        ],
        [ 69.55621338,   0.        ,   0.        ,   0.        ]]),
 array([[ 1, -1,  0,  0],
        [ 1,  0,  0,  0],
        [ 1, -1,  0,  0],
        ...,
        [-1,  0,  0,  0],
        [-1,  0,  0,  0],
        [-1,  0,  0,  0]]),
 array([[4.20015335, 2.1510613 , 0.        , 0.        ],
        [2.18804741, 0.        , 0.        , 0.        ],
        [1.41282165, 3.38350415, 0.        , 0.        ],
        ...,
        [3.76294518, 0.        , 0.        , 0.        ],
        [0.55081069, 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        ]])]

And now NumPy's dstack is appropriate. (This is making them contiguous in memory, so you could use NumPy's structured arrays if you want to. I would find that easier for keeping track of which index means which variable, but that's up to you. Actually, Xarray is particularly good at tracking metadata of rectilinear arrays.)

>>> import numpy as np
>>> np.dstack(numpy_arrays)
array([[[-52.89945602, -11.65467167,  -8.1607933 ,  54.77949905,
           1.        ,   4.20015335],
        [ 37.73778152,   0.69347358, -11.3075819 ,  39.40169525,
          -1.        ,   2.1510613 ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ]],

       [[ -0.81645936, -24.40425873,  20.19996834,  31.69044495,
           1.        ,   2.18804741],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ]],

       [[ 48.98783112, -21.72313881,  11.16828537,  54.73978806,
           1.        ,   1.41282165],
        [  0.82756668,  29.8005085 ,  36.96519089,  47.48885727,
          -1.        ,   3.38350415],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ]],

       ...,

       [[-29.75678635, -15.30385876, -52.66374969,  62.39516068,
          -1.        ,   3.76294518],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ]],

       [[  1.14186978,  63.60956955, 162.17631531, 174.20863342,
          -1.        ,   0.55081069],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ]],

       [[ 23.9132061 , -35.66507721,  54.71943665,  69.55621338,
          -1.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ]]])
Jim Pivarski
  • 5,568
  • 2
  • 35
  • 47
  • Thanks this one is very helpful as you go through a similar example to my problem. I would like to ask you how I could scale this script as you choose: '>>> ak.max(ak.num(muons))' '4' In the following days I have to pull multiple root files. I think it is rather ineffecient to iterate through all files to get to the real "max" of all files. Is a high guess sufficient or would it be too expensive in terms of memory? – SergeantIdiot Nov 13 '20 at 02:30
  • To answer this question, you have to ask what you're padding the arrays for. If it's because machine learning algorithms generally take fixed-width vectors as inputs, the choice will have less to do with how many particles are on the dataset and more to do with the properties of the machine learning algorithm. If you don't want to lose any particles, you probably shouldn't be padding the arrays, but doing the analysis directly on the irregular data. – Jim Pivarski Nov 14 '20 at 03:05