2

I'm trying to modify one of my existing scripts that uses uproot to read data from a root file into a pandas dataframe using uproot.pandas.iterate. Currently it only reads branches containing simple data types (floats, ints, bools), but I would like to add the ability to read some branches that store 3x3 matrices. I understand from looking at the readme that, in cases like this, it's recommended to flatten the structure by passing flatten=True as an argument to the iterate function. However, when I do this, it crashes:

Traceback (most recent call last):
  File "genPreselTuples.py", line 338, in <module>
    data = read_events(args.decaymode, args.tag, args.year, args.polarity, chunk=args.chunk, numchunks=args.numchunks, verbose=args.verbose, testing=args.testing)
  File "genPreselTuples.py", line 180, in read_events
    for df in uproot.pandas.iterate(filename_list, treename, branches=list(branchdict.keys()), entrysteps=100000, namedecode='utf-8', flatten=True):
  File "/afs/cern.ch/work/d/djwhite/miniconda3/envs/D02HHHHml/lib/python3.8/site-packages/uproot/tree.py", line 117, in iterate
    for start, stop, arrays in tree.iterate(branches=branchesinterp, entrysteps=entrysteps, outputtype=outputtype, namedecode=namedecode, reportentries=True, entrystart=0, entrystop=tree.numentries, flatten=flatten, flatname=flatname, awkwardlib=awkward, cache=cache, basketcache=basketcache, keycache=keycache, executor=executor, blocking=blocking):
  File "/afs/cern.ch/work/d/djwhite/miniconda3/envs/D02HHHHml/lib/python3.8/site-packages/uproot/tree.py", line 721, in iterate
    out = out()
  File "/afs/cern.ch/work/d/djwhite/miniconda3/envs/D02HHHHml/lib/python3.8/site-packages/uproot/tree.py", line 678, in <lambda>
    return lambda: uproot._connect._pandas.futures2df([(branch.name, interpretation, wrap_again(branch, interpretation, future)) for branch, interpretation, future, past, cachekey in futures], outputtype, start, stop, flatten, flatname, awkward)
  File "/afs/cern.ch/work/d/djwhite/miniconda3/envs/D02HHHHml/lib/python3.8/site-packages/uproot/_connect/_pandas.py", line 162, in futures2df
    array = array.view(awkward.numpy.dtype([(str(i), array.dtype) for i in range(functools.reduce(operator.mul, array.shape[1:]))])).reshape(array.shape[0])
ValueError: When changing to a larger dtype, its size must be a divisor of the total size in bytes of the last axis of the array.

My code is the following:

# prepare for file reading
data = pd.DataFrame() # create empty dataframe to hold final output data
file_counter = 0      # count how many files have been processed
event_counter = 0     # count how many events were in input files that have been processed

# loop over files in filename_list & add contents to dataframe
for df in uproot.pandas.iterate(filename_list, treename, branches=list(branchdict.keys()), entrysteps=100000, namedecode='utf-8', flatten=True):
    df.rename(branchdict, axis='columns', inplace=True)   # rename branches to custom names (defined in dictionary)
    
    file_counter += 1                # manage file counting
    event_counter += df.shape[0]     # manage event counting
    
    print(df.head(10)) # debugging
    
    # apply all cuts
    for cut in cutlist:
        df.query(cut, inplace=True)
    
    # append events to dataframe of data
    data = data.append(df, ignore_index=True)
    
    # terminal output
    print('Processed '+format(file_counter,',')+' chunks (kept '+format(data.shape[0],',')+' of '+format(event_counter,',')+' events ({0:.2f}%))'.format(100*data.shape[0]/event_counter), end='\r')

I have been able to get it to work with flatten=False (when printing the dataframe, it explodes the values out into columns similar to how is shown here: https://github.com/scikit-hep/uproot#multiple-values-per-event-fixed-size-arrays).

   eventNumber  runNumber  totCandidates  nCandidate  ...  D0_SubVtx_234_COV_[1][2]  D0_SubVtx_234_COV_[2][0]  D0_SubVtx_234_COV_[2][1]  D0_SubVtx_234_COV_[2][2]
0     13769776     177132              3           0  ...                 -0.016343                  0.032616                 -0.016343                  0.470791
1     13769776     177132              3           1  ...                 -0.016343                  0.032616                 -0.016343                  0.470791
2     13769776     177132              3           2  ...                 -0.016343                  0.032616                 -0.016343                  0.470791
3     36250092     177132              2           0  ...                  0.004726                 -0.017212                  0.004726                  0.193447
4     36250092     177132              2           1  ...                  0.004726                 -0.017212                  0.004726                  0.193447

[5 rows x 296 columns]

But I understand from the readme that not flattening these structures isn't recommended, at least for speed purposes - and since I have O(10^8) rows to get through, speed is somewhat of a concern. I'm interested in what's causing this, so I can find out the best way to handle these objects (& eventually write them out to a new file later). Thanks!

EDIT: I've narrowed the problem down to the branches option. If I manually specify some branches (eg. branches=['eventNumber, D0_SubVtx_234_COV_']) then it works fine with both flatten=True and False. But when using this list(branchdict.keys()), it gives the ValueError shown at the top of the original question.

I've checked this list, & all the elements in it are real branch names (or else it gives a KeyError instead) - it contains 206 regular branches, some of which contain standard data types and others contain length-1 lists of single data types, plus 10 branches containing similar 3x3 matrices.

If I remove the branches containing the matrices from this list, then it works as expected. The same is true if I remove only the length-1 lists. The crash occurs whenever I try to read (separate) branches containing both these length-1 lists and these 3x3 matrices.

DylanJaide
  • 385
  • 1
  • 3
  • 13
  • I'll have to revisit the README to understand why it suggests that flattening is good or bad; it removes information from the output that you might or might not want. From what I see of this output, though, your data _are_ flat (no jagged arrays), so the question is moot. On the contrary, it looks like a bug in that `flatten=True` is trying to do something on already-flat data when it should be a pass-through. (Likely an untested case.) Do you want to file a bug-report? – Jim Pivarski Jul 16 '20 at 17:05
  • From your reply, I see that I've slightly misunderstood what's meant by 'flat' in this context - I had assumed that it referred to single-valued branches like individual floats, but now I see that it counts as flat as long as each entry has the same size/shape (eg. a matrix that's always 3x3, as opposed to an array of variable length). Nonetheless if you think this is a bug I'm happy to file a report - I will try to collect some more detailed output, do you suggest opening an issue on the uproot github page? – DylanJaide Jul 16 '20 at 17:31
  • Now I have thought about it more, I think it's worth adding this: because I previously misunderstood what `flatten=True` did, I've used that option for the whole time I've been working with branches containing only individual floats, ints, etc., and I've not seen that error before. So if it is a bug where already flat data should just be passed through, it must be something specific to arrays or other more complex data types. – DylanJaide Jul 16 '20 at 17:48
  • Hi @JimPivarski, while trying to provide some clearer examples of this for a bug report, I found that when specifying the branches manually it seemed to work fine with `flatten=True` specified - I'm adding details in an edit to the original question now – DylanJaide Jul 17 '20 at 11:43
  • On second thought, I think we can ignore this bug. I'm actively developing Uproot 4, it's in a usable state now, and it doesn't have a `flatten` parameter. I think this came out of earlier hand-wringing about how to turn HEP-style data into DataFrames, and we've mostly settled on the idea of putting the jaggedness into Pandas's MultiIndex. Fixed-width things often represent components, which you'd want as columns to make it easier to perform calculations in which you assume that they're `x`, `y`, and `z`, or `cov00`, `cov01`, `cov11`, etc. But Pandas can also turn columns into nested indexes. – Jim Pivarski Jul 17 '20 at 18:54
  • People use the word "flat" to mean many different things, which is another reason it's not a good function argument name. For many physicists, "flat" means no classes in ROOT, though the data could have all sorts of structure in `std::vectors` and `std::maps`. For me, "flat" means not jagged, since a fixed number of subentries can just be renamed as new columns (as our standard Pandas conversion does). To avoid ambiguity, I've started using "rectangular" or "rectilinear," rather than "flat." If your data are rectilinear, it will easily fit into DataFrames, without even invoking MultiIndex. – Jim Pivarski Jul 17 '20 at 18:58
  • Thanks @JimPivarski, this makes sense. I have narrowed my original problem down to when it tries to read one of the 3x3 matrix branches at the same time as another branch that's not a simple datatype but a list of length 1 (ie. '[180.0]' as opposed to '180.0', which has type object rather than, say, float32). But reading multiple of the matrix branches alone is fine, as is multiple of the length-1 list branches, as long as they don't mix. All of these branches are flat/rectilinear as far as I know - do you have any thoughts on why? Also, do you have a rough timescale for Uproot 4's release? – DylanJaide Jul 20 '20 at 09:46
  • 1
    A branch that is declared as variable length but in practice has length 1 is variable length. There's no way for a predefined routine to know that it won't have length 0 or 2 somewhere later in the dataset. As for Uproot4, it is released now: consider it beta. I'm doing the transitions gradually to have time to address user issues while both versions are available, but I am actively recommending new work to use the new package. – Jim Pivarski Jul 20 '20 at 14:56

0 Answers0