1

My program would concurrently download about 10 million pieces of data with aiohttp and then write the data to about 4000 files on disk.

I use the aiofiles library because I want my program to also do other stuff when it is reading/writing a file.

But I worry that if the program try to write to all of the 4000 files at the same time, the hard disk can't do all the writes that quickly.

Is it possible to limit the number of concurrent writes with aiofiles (or other library)? Does aiofiles already do this?

Thanks.

test code:

import aiofiles
import asyncio


async def write_to_disk(fname):
    async with aiofiles.open(fname, "w+") as f:
        await f.write("asdf")


async def main():
    tasks = [asyncio.create_task(write_to_disk("%d.txt" % i)) 
             for i in range(10)]
    await asyncio.gather(*tasks)


asyncio.run(main())
Duh Huh
  • 139
  • 1
  • 1
  • 7

1 Answers1

2

You can use asyncio.Semaphore to limit the number of concurrent tasks. Simply force your write_to_disk function to acquire the semaphore before writing:

import aiofiles
import asyncio


async def write_to_disk(fname, sema):
    # Edit to address comment: acquire semaphore after opening file
    async with aiofiles.open(fname, "w+") as f, sema:
        print("Writing", fname)
        await f.write("asdf")
        print("Done writing", fname)


async def main():
    sema = asyncio.Semaphore(3)  # Allow 3 concurrent writers
    tasks = [asyncio.create_task(write_to_disk("%d.txt" % i, sema)) for i in range(10)]
    await asyncio.gather(*tasks)


asyncio.run(main())

Note both the sema = asyncio.Semaphore(3) line as well as the addition of sema, in the async with.

Output:

"""
Writing 1.txt
Writing 0.txt
Writing 2.txt
Done writing 1.txt
Done writing 0.txt
Done writing 2.txt
Writing 3.txt
Writing 4.txt
Writing 5.txt
Done writing 3.txt
Done writing 4.txt
Done writing 5.txt
Writing 6.txt
Writing 7.txt
Writing 8.txt
Done writing 6.txt
Done writing 7.txt
Done writing 8.txt
Writing 9.txt
Done writing 9.txt
"""
ParkerD
  • 1,214
  • 11
  • 18
  • When one of the file object is buffering, another file object can write without hurting performance too much. However, if I add a semaphore to the f.write, would that block even when the write is just buffering instead of actually writing to the disk? The data is json. – Duh Huh Dec 13 '19 at 18:20
  • @DuhHuh just updated my answer, moving the semaphore after the `.open` in the `async with` ensures the files will all be opened first, but then the concurrent writes to them are limited – ParkerD Dec 13 '19 at 18:36