While I cannot speak for the upper layer, I can relate to what will happen on lower hardware level and that might provide some insights for your design.
Which ever is the stack on upper layer doing, at the end the operation has to be handled by the transceiver chip.
BLE is operating over 40 channel band in which 3 are used for broadcast and other for data transmission. This is done to be able to have multitude of device communicating together limiting the collision by being on other frequency bands.
Those bands are selected based on the one with the lowest noise (or traffic).
The transceiver himself is only able to communicate (speak and listen) in one band at a time and has to switch between bands to reach other devices. This is done by very tight timing of the communication.
Another fact is that a wireless transceiver is basically some sort of half duplex communication with collision detection, it cannot send and listen at the same time, nor can two device emit at the same time on the same band. It is therefore by design (and laws of nature) serial or sequential.
If you implement some sort of operational queue or threaded implementation, at the end everything will have to be treated serially / sequentially by the transceiver.
If you access to it by different threads, the transceiver may have to jump all the time between channels or perhaps gets confused if it is not well handled on the upper level.
The only good reason I may see to treat that on thread would be that the processing time of the transceiver to be significantly lower than the upper stack you have to run, and you would take advantage of multi core processor.
But otherwise unless very specific software need or architecture, I do not believe you will have significant gain of having a different implementation than serial and I would also speak to the slaves one by one rather than all at the same time for the considerations explained above.