1

I have a list I'd like to write out, data, one file for each item like so:

for i,chunk in enumerate(data):
    fname = ROOT / f'{i}.in'
    with open(fname, "wb") as fout:
        dill.dump(chunk, fout)

Since the data list can be quite long and I'm writing to a network storage location, I'm spending a lot of time waiting for the iteration in NFS back and forth, and I'd like to do this asynchronously if possible.

I have something that basically looks like this now:

import dill
import asyncio
import aiofiles
from pathlib import Path

ROOT = Path("/tmp/")

data = [str(i) for i in range(500)]

def serialize(data):
  """
  Write my data out in serial
  """
  for i,chunk in enumerate(data):
    fname = ROOT / f'{i}.in'
    print(fname)
    with open(fname, "wb") as fout:
        dill.dump(chunk, fout)

def aserialize(data):
  """
  Same as above, but writes my data out asynchronously
  """
  fnames = [ROOT / f'{i}.in' for i in range(len(data))]
  chunks = data
  async def write_file(i):
    fname = fnames[i]
    chunk = chunks[i]
    print(fname)
    async with aiofiles.open(fname, "wb") as fout:
        print(f"written: {i}")
        dill.dump(chunk, fout)
        await fout.flush()
  loop = asyncio.get_event_loop()
  loop.run_until_complete(asyncio.gather(*[write_file(i) for i in range(len(data))]))

Now, when I test the writes, this looks fast enough to be worthwhile on my NFS:

# test 1
start = datetime.utcnow()
serialize(data)
end = datetime.utcnow()
print(end - start)
# >>> 0:02:04.204681

# test 3
start = datetime.utcnow()
aserialize(data)
end = datetime.utcnow()
print(end - start)
# >>> 0:00:27.048893
# faster is better.

But when I actually /de/-serialize the data I wrote, I see that maybe it was fast because it wasn't writing anything:

def deserialize(dat):
  tmp = []
  for i in range(len(dat)):
    fname = ROOT / f'{i}.in'
    with open(fname, "rb") as fin:
      fo = dill.load(fin)
    tmp.append(fo)
  return tmp

serialize(data)
d2 = deserialize(data)
d2 == data
# True

Good, whereas:

aserialize(data)
d3 = deserialize(data)
>>> Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 6, in deserialize
  File "...python3.7/site-packages/dill/_dill.py", line 305, in load
    obj = pik.load()
EOFError: Ran out of input

That is, the asynchronously written files are empty. No wonder it was so fast.

How can I dill/pickle my list into files asynchronously and get them to actually write? I assume I need to await the dill.dump somehow? I thought the fout.flush would handle that, but seems not.

Mittenchops
  • 18,633
  • 33
  • 128
  • 246
  • 2
    In aiofiles `write` method is a coroutine. Which means `f.write()` has to be awaited. dill library don't know about it and thinks the `fin` you passed is a regular file. You should have got "RuntimeWarning: coroutine was never awaited". @sanyash 's answer should get it working. But not sure if it will be faster – balki Feb 06 '20 at 04:14

1 Answers1

1

I changed the line dill.dump(chunk, fout) to await fout.write(dill.dumps(chunk)) and got the data written to the files and deserialized correctly. Seems like dill.dump works only with regular synchronous files calling file.write method without await keyword.

sanyassh
  • 8,100
  • 13
  • 36
  • 70
  • Are these equivalent? I'm looking at: https://dill.readthedocs.io/en/latest/dill.html#dill._dill.dumps and it's not clear to me that `f.write(dill.dumps(x)) == dill.dump(x, f)` – Mittenchops Feb 05 '20 at 03:22
  • https://github.com/uqfoundation/dill/blob/master/dill/_dill.py#L253-L266 – Mittenchops Feb 05 '20 at 03:25
  • 1
    @Mittenchops as you see, `dumps` uses `dump` to write data to `StringIO` object, which acts like in-memory file (https://stackoverflow.com/q/44672524/9609843). Thus its content is supposed to be equal to those written to a regular file. – sanyassh Feb 05 '20 at 08:38