The extend
method is supposed to write a new TBasket with each invocation. (See the documentation, especially the orange warning box. The purpose of that is so that you can control the TBasket sizes.) If you're calling it 10000 times to write 1 value (the value [1, 2, 3]
) each, that's a maximally inefficient use.
Fundamentally, you're thinking about this problem in an entry-by-entry way, rather than in terms of columns, the way that scientific processing is normally done in Python. What you want to do instead is to collect a large dataset in memory and write it to the file in one chunk. If the data that you'll eventually be addressing is larger than the memory on your computer, you would do it in "large enough" chunks, which is probably on the order of hundreds of megabytes or gigabytes.
For instance, starting with your example,
import time
import uproot
import numpy as np
import awkward as ak
np_array = np.array([[1, 2, 3]])
ak_array = ak.from_numpy(np_array)
starttime = time.time()
with uproot.recreate("bad.root") as fout:
fout.mktree("tree", {"branch": ak_array.type})
for _ in range(10000):
fout["tree"].extend({"branch": ak_array})
print("Total time:", time.time() - starttime)
The total time (on my computer) is 1.9 seconds and the TTree characteristics are atrocious:
******************************************************************************
*Tree :tree : *
*Entries : 10000 : Total = 1170660 bytes File Size = 2970640 *
* : : Tree compression factor = 1.00 *
******************************************************************************
*Br 0 :branch : branch[3]/L *
*Entries : 10000 : Total Size= 1170323 bytes File Size = 970000 *
*Baskets : 10000 : Basket Size= 32000 bytes Compression= 1.00 *
*............................................................................*
Instead, we want the data to be in a single array (or some loop that produces ~GB scale arrays):
np_array = np.array([[1, 2, 3]] * 10000)
(This isn't necessarily how you would get np_array
, since * 10000
makes a large, intermediate Python list. Suffice to say, you get the data somehow.)
Now we do the write with a single call to extend
, which makes a single TBasket:
np_array = np.array([[1, 2, 3]] * 10000)
ak_array = ak.from_numpy(np_array)
starttime = time.time()
with uproot.recreate("good.root") as fout:
fout.mktree("tree", {"branch": ak_array.type})
fout["tree"].extend({"branch": ak_array})
print("Total time:", time.time() - starttime)
The total time (on my computer) is 0.0020 seconds and the TTree characteristics are much better:
******************************************************************************
*Tree :tree : *
*Entries : 10000 : Total = 240913 bytes File Size = 3069 *
* : : Tree compression factor = 107.70 *
******************************************************************************
*Br 0 :branch : branch[3]/L *
*Entries : 10000 : Total Size= 240576 bytes File Size = 2229 *
*Baskets : 1 : Basket Size= 32000 bytes Compression= 107.70 *
*............................................................................*
So, the writing is almost 1000× faster and the compression is 100× better. (With one entry per TBasket in the previous example, there was no compression because any compressed data would be bigger than the original!)
By comparison, if we do entry-by-entry writing with PyROOT,
import time
import array
import ROOT
data = [1, 2, 3]
holder = array.array("q", [0]*3)
file = ROOT.TFile("pyroot.root", "RECREATE")
tree = ROOT.TTree("tree", "An Example Tree")
tree.Branch("branch", holder, "branch[3]/L")
starttime = time.time()
for _ in range(10000):
for i, x in enumerate(data):
holder[i] = x
tree.Fill()
tree.Write("", ROOT.TObject.kOverwrite)
file.Close()
print("Total time:", time.time() - starttime)
The total time (on my computer) is 0.062 seconds and the TTree characteristics are fine:
******************************************************************************
*Tree :tree : An Example Tree *
*Entries : 10000 : Total = 241446 bytes File Size = 3521 *
* : : Tree compression factor = 78.01 *
******************************************************************************
*Br 0 :branch : branch[3]/L *
*Entries : 10000 : Total Size= 241087 bytes File Size = 3084 *
*Baskets : 8 : Basket Size= 32000 bytes Compression= 78.01 *
*............................................................................*
So, PyROOT is 30× slower here, but the compression is almost as good. ROOT decided to make 8 TBaskets, which is configurable with AutoFlush
parameters.
Keep in mind, though, that this is a comparison of techniques, not libraries. If you wrap a NumPy array with RDataFrame and write that, then you can skip all of the overhead involved in the Python for loop and you get the advantages of columnar processing.
But columnar processing only matters if you're working with big data. Much like compression, if you apply it to very small datasets (or a very small dataset many times), then it can hurt, rather than help.