I have a list I'd like to write out, data
, one file for each item like so:
for i,chunk in enumerate(data):
fname = ROOT / f'{i}.in'
with open(fname, "wb") as fout:
dill.dump(chunk, fout)
Since the data list can be quite long and I'm writing to a network storage location, I'm spending a lot of time waiting for the iteration in NFS back and forth, and I'd like to do this asynchronously if possible.
I have something that basically looks like this now:
import dill
import asyncio
import aiofiles
from pathlib import Path
ROOT = Path("/tmp/")
data = [str(i) for i in range(500)]
def serialize(data):
"""
Write my data out in serial
"""
for i,chunk in enumerate(data):
fname = ROOT / f'{i}.in'
print(fname)
with open(fname, "wb") as fout:
dill.dump(chunk, fout)
def aserialize(data):
"""
Same as above, but writes my data out asynchronously
"""
fnames = [ROOT / f'{i}.in' for i in range(len(data))]
chunks = data
async def write_file(i):
fname = fnames[i]
chunk = chunks[i]
print(fname)
async with aiofiles.open(fname, "wb") as fout:
print(f"written: {i}")
dill.dump(chunk, fout)
await fout.flush()
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*[write_file(i) for i in range(len(data))]))
Now, when I test the writes, this looks fast enough to be worthwhile on my NFS:
# test 1
start = datetime.utcnow()
serialize(data)
end = datetime.utcnow()
print(end - start)
# >>> 0:02:04.204681
# test 3
start = datetime.utcnow()
aserialize(data)
end = datetime.utcnow()
print(end - start)
# >>> 0:00:27.048893
# faster is better.
But when I actually /de/-serialize the data I wrote, I see that maybe it was fast because it wasn't writing anything:
def deserialize(dat):
tmp = []
for i in range(len(dat)):
fname = ROOT / f'{i}.in'
with open(fname, "rb") as fin:
fo = dill.load(fin)
tmp.append(fo)
return tmp
serialize(data)
d2 = deserialize(data)
d2 == data
# True
Good, whereas:
aserialize(data)
d3 = deserialize(data)
>>> Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 6, in deserialize
File "...python3.7/site-packages/dill/_dill.py", line 305, in load
obj = pik.load()
EOFError: Ran out of input
That is, the asynchronously written files are empty. No wonder it was so fast.
How can I dill/pickle my list into files asynchronously and get them to actually write? I assume I need to await the dill.dump somehow? I thought the fout.flush would handle that, but seems not.