Writing Trees, number of baskets and compression (uproot)

Question

I am trying to optimize the way trees are written in pyroot and came across uproot. In the end my application should write events (consisiting of arrays) to a tree which are continuously coming in.

The first approach is the classic way:

event= [1.,2.,3.]
f = ROOT.TFile("my_tree.root", "RECREATE")
tree = ROOT.TTree("tree", "An Example Tree")

pt = array.array('f', [0.]*3)


tree.Branch("pt", pt, "pt[3]/F")

#for loop to simulate incoming events
for _ in range(10000):
    for i, element in enumerate(event):
        pt[i] = element

    tree.Fill()

tree.Print()
tree.Write("", ROOT.TObject.kOverwrite);
f.Close()

This gives the following Tree and execution time:

Tree characterisitics

Trying to do it with uproot my code looks like this:

np_array = np.array([[1,2,3]])
ak_array = ak.from_numpy(np_array)

with uproot.recreate("testing.root", compression=None) as fout:
    fout.mktree("tree", {"branch": ak_array.type})
    
    for _ in range(10000):
        
        fout["tree"].extend({"branch": ak_array})

which gives the following tree:

Tree characteristics

So the uproot method takes much longer, the file size is much bigger and each event gets a seperate basket. I tried out different commpression settings but that did not change anything. Any idea on how to optimize this? Is this even a sensible usecase for uproot and can the process of writing trees being speed up in comparision to the first way of doing it?

Uproot is not an improvement over pyroot at all. In fact, for most basic usecases it will perform worse off than pyroot. Only for larger arrays (10s of kilobytes large) will Uproot meet and potentially outperform pyroot — Manish Dash, May 10 '22 at 14:23

score 1 · Answer 1 · answered May 10 '22 at 15:10

The extend method is supposed to write a new TBasket with each invocation. (See the documentation, especially the orange warning box. The purpose of that is so that you can control the TBasket sizes.) If you're calling it 10000 times to write 1 value (the value [1, 2, 3]) each, that's a maximally inefficient use.

Fundamentally, you're thinking about this problem in an entry-by-entry way, rather than in terms of columns, the way that scientific processing is normally done in Python. What you want to do instead is to collect a large dataset in memory and write it to the file in one chunk. If the data that you'll eventually be addressing is larger than the memory on your computer, you would do it in "large enough" chunks, which is probably on the order of hundreds of megabytes or gigabytes.

For instance, starting with your example,

import time
import uproot
import numpy as np
import awkward as ak

np_array = np.array([[1, 2, 3]])
ak_array = ak.from_numpy(np_array)

starttime = time.time()

with uproot.recreate("bad.root") as fout:
    fout.mktree("tree", {"branch": ak_array.type})
    for _ in range(10000):
        fout["tree"].extend({"branch": ak_array})

print("Total time:", time.time() - starttime)

The total time (on my computer) is 1.9 seconds and the TTree characteristics are atrocious:

******************************************************************************
*Tree    :tree      :                                                        *
*Entries :    10000 : Total =         1170660 bytes  File  Size =    2970640 *
*        :          : Tree compression factor =   1.00                       *
******************************************************************************
*Br    0 :branch    : branch[3]/L                                            *
*Entries :    10000 : Total  Size=    1170323 bytes  File Size  =     970000 *
*Baskets :    10000 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*

Instead, we want the data to be in a single array (or some loop that produces ~GB scale arrays):

np_array = np.array([[1, 2, 3]] * 10000)

(This isn't necessarily how you would get np_array, since * 10000 makes a large, intermediate Python list. Suffice to say, you get the data somehow.)

Now we do the write with a single call to extend, which makes a single TBasket:

np_array = np.array([[1, 2, 3]] * 10000)
ak_array = ak.from_numpy(np_array)

starttime = time.time()

with uproot.recreate("good.root") as fout:
    fout.mktree("tree", {"branch": ak_array.type})
    fout["tree"].extend({"branch": ak_array})

print("Total time:", time.time() - starttime)

The total time (on my computer) is 0.0020 seconds and the TTree characteristics are much better:

******************************************************************************
*Tree    :tree      :                                                        *
*Entries :    10000 : Total =          240913 bytes  File  Size =       3069 *
*        :          : Tree compression factor = 107.70                       *
******************************************************************************
*Br    0 :branch    : branch[3]/L                                            *
*Entries :    10000 : Total  Size=     240576 bytes  File Size  =       2229 *
*Baskets :        1 : Basket Size=      32000 bytes  Compression= 107.70     *
*............................................................................*

So, the writing is almost 1000× faster and the compression is 100× better. (With one entry per TBasket in the previous example, there was no compression because any compressed data would be bigger than the original!)

By comparison, if we do entry-by-entry writing with PyROOT,

import time
import array
import ROOT

data = [1, 2, 3]
holder = array.array("q", [0]*3)

file = ROOT.TFile("pyroot.root", "RECREATE")
tree = ROOT.TTree("tree", "An Example Tree")
tree.Branch("branch", holder, "branch[3]/L")

starttime = time.time()
for _ in range(10000):
    for i, x in enumerate(data):
        holder[i] = x

    tree.Fill()

tree.Write("", ROOT.TObject.kOverwrite)
file.Close()

print("Total time:", time.time() - starttime)

The total time (on my computer) is 0.062 seconds and the TTree characteristics are fine:

******************************************************************************
*Tree    :tree      : An Example Tree                                        *
*Entries :    10000 : Total =          241446 bytes  File  Size =       3521 *
*        :          : Tree compression factor =  78.01                       *
******************************************************************************
*Br    0 :branch    : branch[3]/L                                            *
*Entries :    10000 : Total  Size=     241087 bytes  File Size  =       3084 *
*Baskets :        8 : Basket Size=      32000 bytes  Compression=  78.01     *
*............................................................................*

So, PyROOT is 30× slower here, but the compression is almost as good. ROOT decided to make 8 TBaskets, which is configurable with AutoFlush parameters.

Keep in mind, though, that this is a comparison of techniques, not libraries. If you wrap a NumPy array with RDataFrame and write that, then you can skip all of the overhead involved in the Python for loop and you get the advantages of columnar processing.

But columnar processing only matters if you're working with big data. Much like compression, if you apply it to very small datasets (or a very small dataset many times), then it can hurt, rather than help.

Thank you very much for the clarification! So would it be most sensible to store the incoming data first in an numpy array for example, in this regard take up as much memory as possible, then turning it into the awkward array and then writing it to the tree with the extend method? Or rather use the RDataFrame tools? — Jailbone, May 10 '22 at 15:39
In the above case, each entry has the same number of items (3), so it could be a NumPy array *instead of* an Awkward Array. (You're only using `ak.Array.type`; Uproot would take a shaped dtype in its place.) If the number of items per entry is going to be variable (jagged), you'll need it to be an Awkward Array. But in either case, you'll accumulate as much as you can per call to `uproot.WritableTTree.extend`. — Jim Pivarski, May 11 '22 at 16:51
RDataFrame works, too, though then it would be easier to use RDataFrame's own `Snapshot` method to write a TTree. If you're in RDataFrame, then everything needs to be entry-by-entry instead of columnar, so which one you choose depends on what you prefer or what other constraints you have. We're also implementing Awkward Array ↔ RDataFrame conversions, so your choices will be even less constrained because you can always go between them. Then it's just a matter of using Awkward when you feel like doing columnar processing and RDataFrame (or Numba) when you don't. — Jim Pivarski, May 11 '22 at 16:55

Writing Trees, number of baskets and compression (uproot)

1 Answers1