I'm just getting started using the bcolz
package and running through the tutorial on ctables
. Creating a table using the fromiter
function, i.e:
N = 100*1000
ct = bcolz.fromiter(((i,i*i) for i in range(N)), dtype="i4,f8", count=N, rootdir='mydir', mode="w")
is fast, taking about 30ms on my computer (2.7GHz Core i7 with SSD storage), however the second example:
with bcolz.zeros(0, dtype="i4,f8", rootdir='mydir', mode="w") as ct:
for i in range(N):
ct.append((i, i**2))
is very slow (45 seconds). I can get it closer to the fromiter
time by not writing to disk (i.e. removing rootdir='mydir', mode="w"
, but it's still around 2 seconds).
This example uses a lot of very small appends and I'm wondering if this is the recommended use case when one has lots of data. There aren't any hard numbers on how long these operations should take, just lots of suggestions that the library is fast.
I tried modifying the code to write the data in blocks:
with bcolz.zeros(0, dtype="i4,f8", rootdir="mydir", mode='w') as ct:
for i in range(10):
ii = np.arange(10000) + 10000*i
ct.append((ii,ii**2))
and this now takes 45ms—down to 6ms if I don't write to disk. This seems more compatible with the suggested uses for bcolz
cases I've seen.
I can't find much documentation about needing blocking when writing, so I think it may be due to my system?