3

I'm setting up a machine learning project with scikit learn. The input data are flat ROOT NTuples.

In the past I have been using root_numpy to convert the NTuples to pandas.DataFrame saved in an h5 file.

I was wondering if I could use uproot to
a) skip the h5 conversion altogether?
b) not use as much memory as loading in the DataFrame from h5?

My naive first try looks something like this:

'''
Runs preselection, keeps only desired variables in DataFrame
'''
def dropAndKeep(df, dropVariables=None, keepVariables=None, presel=None, inplace=True):

    if ((presel is not None) and (not callable(presel))):
        print("Please either provide a function to 'presel' or leave blank")
        raise ValueError

    if callable(presel):
        if not(inplace):
            df = df.drop(df[~presel(df)].index, inplace=False)
        else:
            df.drop(df[~presel(df)].index, inplace=True)

    if keepVariables is not None:
        dropThese = list( set(df.columns) - set(keepVariables) )
        return df.drop(columns=dropThese, inplace=inplace)

    if dropVariables is not None:
        return df.drop(columns=dropVariables, inplace=inplace)

'''
Loads a TTree from ROOT file into a DataFrame 
'''
def load_root(inFile, key, dropVariables=None, keepVariables=None, presel=None):
    df = uproot.open(inFile)[key].pandas.df()
    dropAndKeep(df, dropVariables, keepVariables, presel=presel, inplace=True)
    return df


inFile = "path/to/file.root"
key = "ntuple"
df = load_root(inFile, key)

This takes a really long time. Is there a better way of doing this?

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
GabrielG
  • 31
  • 3

4 Answers4

5

Be aware that each call to uproot.open(...) and file [key] loads TFile and TTree metadata using pure Python—the slowest part of uproot. If you call this more than once, try keeping TFile and/or TTree objects around and re-using them.

Also, it looks like your dropAndKeep function is only dropping rows (events), but if I'm reading it wrong and it's doing columns (branches), then use the branches argument of uproot's array-reading functions to only send the branches you want. Since the data in a ROOT file are arranged in columns, you can't avoid reading unwanted events—you have to cut them after the fact (in any framework).

Next, note that Pandas is considerably slower than NumPy for simple operations like filtering events. If you want to speed that up, get the arrays with TTree.arrays, rather than TTree.pandas.df, construct one NumPy array of booleans for your selection, and apply it to each array in the dict that TTree.arrays returns. Then you can put all of those into a DataFrame with Pandas's DataFrame constructor (if you really need Pandas at all).

It's true that you don't need to go through HDF5, and you don't need to go through Pandas, either. Your machine learning framework (TensorFlow? Torch?) almost certainly has an interface that accepts NumPy arrays with zero-copy (or one-copy to the GPU). Tutorials stressing HDF5 or Pandas do so because for the majority of users (non-HEP), these are the most convenient interfaces. Their data are likely already in HDF5 or Pandas; our data are likely in ROOT.

If your machine learning will be on the GPU, perhaps you want to do your event selection on the GPU as well. CuPy is a NumPy clone that allocates and operates entirely on the GPU, and your TensorFlow/Torch tensors might have a zero-copy interface to CuPy arrays. In principle, uproot should be able to write directly from ROOT files into CuPy arrays if a CuPy array is used as the destination of the asarray interpretation. I haven't tried it, though.

If you have control over the ROOT files to process, try to make their baskets large (increase the flush size) and their data structure simple (e.g. pure numbers or array/vector of numbers, no deeper). Perhaps most importantly, use a lightweight compression like lz4, rather than a heavyweight Luke lzma.

Uproot can read baskets in parallel, but this has only proven useful when it has a lot of non-Python computation to do, such as decompressing lzma.

If you're going to be reading these arrays over and over, you might want to write intermediate files with numpy.save, which is essentially just raw bytes on disk. That means there's no deserialization when reading it back, as opposed to the work necessary to decode a ROOT or HDF5 file. Because it's such a simple format, you can even read it back with numpy.memmap, which peeks into the OS's page cache as it lazily loads the data from disk, removing even the explicit copy of bytes.

Not all of these tricks will be equally helpful. I tried to put the most important ones first, but experiment before committing to a big code rewrite that might not make much difference. Some tricks can't be combined with others, such as CuPy and memmap (memmap always lazily loads into main memory, never GPU memory). But some combinations may be fruitful.

Good luck!

Jim Pivarski
  • 5,568
  • 2
  • 35
  • 47
  • I forgot to mention [bcolz](https://bcolz.readthedocs.io/en/latest/) and [zarr](https://zarr.readthedocs.io/en/stable/), which are both lightweight formats to quickly deliver compressed data as NumPy arrays—in case you like the lightweight intermediate idea but need compression on the intermediate data. Also, be sure to use a fast disk for that, like SSD. – Jim Pivarski Nov 12 '19 at 12:37
4

And everyone is forgetting an obvious ingredient: RDataFrame.AsNumpy(), see e.g. https://root.cern.ch/doc/master/df026__AsNumpyArrays_8py.html

With that, there's no need for temporary files nor to load everything into memory. And reading happens in native C++ speed. Happy to see a report on what worked better at https://root-forum.cern.ch !

Axel Naumann
  • 176
  • 5
  • 2
    But still you need to install ROOT and there are many machine learning guys who just want to read the NTuples and not compile and install a huge framework only for I/O. `uproot` is perfectly fine and people who need `Numpy data and work in that "world" and thus do not depend on actual ROOT functionalities. – tamasgal Nov 18 '19 at 10:39
3

One more side point from me. I'm in the uproot -> hdf5 camp; that way I can do the slow part (reading the files into hdf5) once, as well as combine smaller files and do a little processing. I also keep the compression low or off. This can take a 4-5 minute uproot read of many files into a <10-second hdf5 read of a few files.

The point I can add is that if you have "jagged" data, such as truth information, this works beautifully by directly using AwkwardArray, which has native support for hdf5. I use h5py to work with the HDF5 files. You can see what I do here: https://gitlab.cern.ch/LHCb-Reco-Dev/pv-finder.

This used to also be designed this way because I didn't have an environment I could run anywhere with ROOT and ML tools in it at the same time, but now I use a single environment.yml file with both, using Conda-forge ROOT and the ML tools (PyTorch, etc).

Henry Schreiner
  • 905
  • 1
  • 7
  • 18
1

It looks like Jim has provided a great overview of options, so I'll provide a strategy that is a bit specialized:

Since you appear to be performing a "preselection" step, I think you can potentially benefit from saving intermediate files in a NumPy, HDF5, or Parquet format; this way you avoid repeating the selection calculation every time you process your data on disk (and loading these formats into NumPy or pandas is as trivial as saving them). So, my suggestion would be to load the ROOT-flavored data once (and only read branches you are interested in), perform preselection step(s), and save intermediate files for later use. I'll give a more concrete example:

I've had a workflow that included three selections. We can represent them as pandas.DataFrame.eval/pandas.DataFrame.query strings. (pandas.eval is accelerated with numexpr when available). These are similar to TTree::Draw selections. Here's an arbitrary example where my tree has the columns [electron_pt, regionA, regionB, regionC].

selectA = "(electron_pt >= 25) & (regionA == True)"
selectB = "(electron_pt >= 30) & (regionB == True)"
selectC = "(electron_pt >= 35) & (regionC == True)"

I can load my data into a dataframe once and apply the selections:

keep_columns = [......] # some list of branches to keep, must contain selection branches
df = uproot.open("file.root").get("tree").pandas.df(branches=keep_columns)

selections = {
    "A": selectA,
    "B": selectB,
    "C": selectC
}

now we can loop over the selections, query the dataframe, and save the intermediate format containing only a specific selection.

for name, selection in selections.items():
    df.query(selection).to_hdf("file_selection{}.h5".format(name), name)
    # or save to parquet (if pyarrow is installed):
    # df.query(selection).to_parquet("file_section{}.parquet".format(name))

Then later read the files back into memory with pandas.read_hdf or pandas.read_parquet.

This is a strategy that has worked very well for me in the past when I've trained ML classifiers on data that originate from a common source, but need to be categorized into a few different selections.

ddavis
  • 337
  • 5
  • 15