1

We have some code that goes along the lines of

aiocb* aiocbptr = new aiocb;
// populate aiocbptr with info for the write
aio_write( aiocbptr );

// Then do this periodically:
if(aio_error( aiocbptr ) == 0) {
    delete aiocbptr;
}

aio_error is meant to return 0 when the write is completed, and hence we assume that we can call delete on aiocbptr at this point.

This mostly seems to work OK, but we recently started experiencing random crashes. The evidence points to the data pointed to by aiocbptr being modified after the call to delete.

Is there any issue using aio_error to poll for aio_write completion like this? Is there a guarantee that the aiocb will not be modified after aio_error has returned 0?

This change seems to indicate that something may have since been fixed with aio_error. We are running on x86 RHEL7 linux with glibc v 2.17, which predates this fix.

We tried using aio_suspend in addition to aio_error, so once aio_error has returned 0, we call aio_suspend, which is meant to wait for the operation to complete. But the operation should have already completed, so aio_suspend should do nothing. However, it seemed to fix the crashes.

  • A busy wait kind of defeats the entire point of using aio... – Shawn Jan 19 '23 at 19:34
  • Updated it to be more like what we do - polling aio_error occasionally – Dave Poston Jan 19 '23 at 23:11
  • From https://pubs.opengroup.org/onlinepubs/9699919799/ : `The aiocb structure and the data buffers associated with the asynchronous I/O operation are being used by the system for asynchronous I/O while, and only while, the error status of the asynchronous operation is equal to [EINPROGRESS]`. Your code is correct. To answer `Can aio_error be used to poll...?` yes, that's what it is for. Dumb idea, try adding `aio_return` or `aio_cancel` before the call to `delete`. – KamilCuk Jan 19 '23 at 23:59
  • Well, as I said, adding aio_suspend before the call to delete fixes it, so calling aio_return or aio_cancel would probably fix it too. Looking at the fix to aio_error it seems like there might be a memory-ordering race bug in older verisons of libc. – Dave Poston Jan 20 '23 at 08:49

1 Answers1

1

Yes, my commit was fixing a missing memory barrier. Using e.g. aio_suspend triggers the memory barrier and thus fixes it too.

  • Do you happen to remember how the data race would break? Seems the writer did write(aiocb->data), mutex lock, aiocb->status=COMPLETE, and somehow my thread witnessed aiocb->status == COMPLETE followed by write(aiocb->data)? From what I've read on x86 architecture it seems store-store re-ordering is not allowed, so I would agree with the original code in thinking the mutex lock was not necessary! – Dave Poston Feb 01 '23 at 09:04