Be aware that each call to uproot.open(...)
and file [key]
loads TFile and TTree metadata using pure Python—the slowest part of uproot. If you call this more than once, try keeping TFile and/or TTree objects around and re-using them.
Also, it looks like your dropAndKeep
function is only dropping rows (events), but if I'm reading it wrong and it's doing columns (branches), then use the branches
argument of uproot's array-reading functions to only send the branches you want. Since the data in a ROOT file are arranged in columns, you can't avoid reading unwanted events—you have to cut them after the fact (in any framework).
Next, note that Pandas is considerably slower than NumPy for simple operations like filtering events. If you want to speed that up, get the arrays with TTree.arrays
, rather than TTree.pandas.df
, construct one NumPy array of booleans for your selection, and apply it to each array in the dict that TTree.arrays
returns. Then you can put all of those into a DataFrame with Pandas's DataFrame constructor (if you really need Pandas at all).
It's true that you don't need to go through HDF5, and you don't need to go through Pandas, either. Your machine learning framework (TensorFlow? Torch?) almost certainly has an interface that accepts NumPy arrays with zero-copy (or one-copy to the GPU). Tutorials stressing HDF5 or Pandas do so because for the majority of users (non-HEP), these are the most convenient interfaces. Their data are likely already in HDF5 or Pandas; our data are likely in ROOT.
If your machine learning will be on the GPU, perhaps you want to do your event selection on the GPU as well. CuPy is a NumPy clone that allocates and operates entirely on the GPU, and your TensorFlow/Torch tensors might have a zero-copy interface to CuPy arrays. In principle, uproot should be able to write directly from ROOT files into CuPy arrays if a CuPy array is used as the destination of the asarray interpretation. I haven't tried it, though.
If you have control over the ROOT files to process, try to make their baskets large (increase the flush size) and their data structure simple (e.g. pure numbers or array/vector of numbers, no deeper). Perhaps most importantly, use a lightweight compression like lz4, rather than a heavyweight Luke lzma.
Uproot can read baskets in parallel, but this has only proven useful when it has a lot of non-Python computation to do, such as decompressing lzma.
If you're going to be reading these arrays over and over, you might want to write intermediate files with numpy.save, which is essentially just raw bytes on disk. That means there's no deserialization when reading it back, as opposed to the work necessary to decode a ROOT or HDF5 file. Because it's such a simple format, you can even read it back with numpy.memmap, which peeks into the OS's page cache as it lazily loads the data from disk, removing even the explicit copy of bytes.
Not all of these tricks will be equally helpful. I tried to put the most important ones first, but experiment before committing to a big code rewrite that might not make much difference. Some tricks can't be combined with others, such as CuPy and memmap (memmap always lazily loads into main memory, never GPU memory). But some combinations may be fruitful.
Good luck!