2

Lately I've been playing around a little with the python 3 async features. Overal I'm quit happy with the 3.6 syntax and of course the performance boost you gain. One of the exciting projects evolving around the ASGI standard in my opinion is starlette. I've got a sample app running where I'm reading data from a hdf5 file. h5py does not support asynchronous I/O yet. Which leaves me with the question: does what I'm doing here make any sense at all? To my understanding this code runs synchronously after all. What is the recommended way to do I/O in async contexts?

async def _flow(indexes):
    print('received flow indexes %s ' %indexes)
    # uses h5py under the hood
    gr = GridH5ResultAdmin(gridadmin_f, results_f)
    t = gr.nodes.timeseries(indexes=indexes)
    data = t.only('s1').data
    # data is a numpy array
    return data['s1'].tolist()

@app.route('/flow_velocity')
async def flow_results(request):

    indexes_list = [[2,3,4,5], [6,7,8,9], [10,11,12,13]]

    tasks = []
    loop = asyncio.get_event_loop()
    t0 = datetime.datetime.now()
    for indexes in indexes_list:
        print('Start getting indexes %s' % indexes)
        # Launch a coroutine for each data fetch
        task = loop.create_task(_flow(indexes))
        tasks.append(task)

    # Wait on, and then gather, all data
    flow_data = await asyncio.gather(*tasks)
    dt = (datetime.datetime.now() - t0).total_seconds()
    print('elapsed time: {} [s]'.format(dt))

    return JSONResponse({'flow_velocity': flow_data})

Logging:

INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Start getting indexes "[2, 3, 4, 5]"
Start getting indexes "[6, 7, 8, 9]"
Start getting indexes "[10, 11, 12, 13]"
received flow indexes [2, 3, 4, 5] 
received flow indexes [6, 7, 8, 9] 
received flow indexes [10, 11, 12, 13]
elapsed time: 1.49779 [s]
LarsVegas
  • 6,522
  • 10
  • 43
  • 67

1 Answers1

3

Unfortunately with h5py module you can't use the asyncio, what you do here is majorly sequential, because if the I/O part can't be done asynchronously, then rest of the part of your async code doesn't have much meaning left

https://github.com/h5py/h5py/issues/837

Summary from that thread

So there are two separate issues with adding asyncio support:

  1. asyncio explicitly does not support filesystem I/O at this time, see e.g. https://github.com/python/asyncio/wiki/ThirdParty#filesystem, https://groups.google.com/forum/#!topic/python-tulip/MvpkQeetWZA, What is the status of POSIX asynchronous I/O (AIO)?, and https://github.com/Tinche/aiofiles which is the closest to want you'd want.
  2. All I/O is done through HDF5 (the library), so whatever async support you'd want to add would need support in HDF5 (the library)

This basically means that h5py is unlikely to ever support asyncio.

You could try running things in a thread, no guarantees it will work well though, as I mentioned, HDF5 controls the I/O, and you will want to make sure you don't run into any of its locking controls. You probably will want to understand which file mode mentioned at http://docs.h5py.org/en/latest/high/file.html#file-drivers will work best for you. Maybe you could consider other alternatives such as multiprocessing or concurrent.futures?

Tarun Lalwani
  • 142,312
  • 9
  • 204
  • 265
  • You can pass h5py a python file-like object to h5py and then implement asyncio at the level of the file-like object (implement read, write, truncate, etc), I've got an example of that working (with much effort), but I think I may be running into the h5 locking mechanisms you mention here because things appear to run nearly sequential, though the same code with raw `.read()` calls on the file-like object runs extremely quickly - 1.5 GB/sec random-seek ingest using asyncio (w/ 20 event loop instances) from a local cluster S3 interface. – David Parks Aug 01 '19 at 04:33
  • 1
    There has been some progress in HDF5 space regarding asynchronous I/O: https://hdf5-vol-async.readthedocs.io/en/latest/ , but no python binding I'm aware of. – Filip Brzek Jan 17 '22 at 10:45