1

TL,DR: I need to distinguish at runtime whether some function read from the same memory address more than once.

Context

I have a library written in C that sometimes validates input before processing it. Schematically:

if (!validate(input, input_size)) return ERROR;
return process_valid_input(input, input_size);

This is only correct because the content of input doesn't change between the two function calls. Otherwise process_valid_input might receive invalid input, and that would be bad.

Now I am using this library in a program that wants to call the library with buffers that are in shared memory. The memory may be shared with an untrusted program that could change the data between validation and processing. So the program must copy the input from the shared memory to trusted memory. But copying is expensive (in CPU time and more importantly for us RAM), so I want to avoid it when not necessary. Sometimes the library makes its own copies: it copies a few bytes, works on them, then copies the next few bytes and works on those, and so on.

Reading directly from shared memory is safe as long as the program only ever reads each individual byte once. In other words, it's an error if the program reads from the same address twice. In some cases the library itself is safe, in others it needs the program around it to make a copy. I want to write tests that enforce that the program is safe, i.e. that it never reads from the same shared memory address twice (within a given function call).

The program and the library are portable, but I do most of my testing on Linux, and the relevant behaviors are platform-independent so I only care about doing this testing on Linux. This needs to work with a stock Linux in user mode (on our CI, I can have root, but only inside a Docker container which is running on top of a kernel that's outside my control).

Question

How can I instrument my C program so that

BEGIN_READ_ONCE(input, input_size);   // for me to write
library_function(input, input_size);  // I can't change this
END_READ_ONCE(input, input_size);     // for me to write

will result in a runtime failure if library_function reads from the same address twice? I can't change the library code itself, only add custom code before and after calling the library. I can compile it with instrumentation or run it inside an unusual environment as long as this is doable in Linux user mode.

Example

This is a good function. It prints a sequence of non-zero characters, then stops.

void good(volatile char *input, size_t n) {
    for (size_t i = 0; i < n; i++) {
        char c = *input;
        if (c == 0) break;
        putchar(c); // not supposed to print a null byte
    }
}

This is a bad function, because it reads each character twice. In the real world, it could print a null byte (and keep going) if the input is in shared memory and the value changes between the execution of input[i] == 0 and the execution of putchar(input[i]).

void bad(volatile char *input, size_t n) {
    for (size_t i = 0; i < n && input[i] != 0; i++) {
        putchar(input[i]); // not supposed to print a null byte
    }
}

What I tried

Some things I thought of, but I don't know how to make them work:

  • Maybe I could mmap the memory in some weird way? But how can I detect reads? (I'm sure this is doable with a kernel module, but I can't use a kernel module in my test environment.)
  • Valgrind can notice reads, but how would I tell it “one read good, two reads bad”? I can't think of a way to get any of the checkers to work for me.
  • I could set a read watchpoint in gdb, but (at least on x86_64) this seems to be limited to very small sizes, which makes it impractical in my case (I need to handle inputs of about 1kB).
  • Maybe some profiling tools could help, but they would need trigger on memory accesses and to do exact measurements, without sampling or aggregating, since I need to know for sure whether the same address was accessed twice.
  • I could share the memory with a thread that keeps writing to the memory and hope to trigger a functional error inside the library, but that seems very unreliable.
Gilles 'SO- stop being evil'
  • 104,111
  • 38
  • 209
  • 254
  • Please explain why "Reading directly from shared memory is safe as long as the program only ever reads each individual byte once." Are all bytes (or atomically accessible units) independent from each other? Aren't there any non-atomic values or data structures that need to be consistent? What if process A reads from address X, then process B writes to X and X+1, then A continues reading at X+1? This could even happen inside `validate(input, input_size)` or `process_valid_input(input, input_size)`. Why do you need to allow access by "an untrusted program"? Looks like you need some locking. – Bodo Apr 18 '23 at 14:00
  • 1
    @Bodo As long as my program reads each byte (at most) once, it sees a consistent view of the shared memory, from its own point of view: it sees a particular content of the shared memory, which the other process is allowed to present. Inconsistencies can only arise if at time T1 my process knows that a particular byte (or bit or whatever) contains x, and at time T2 my process knows that it contains y≠x. The order of the reads doesn't matter. // Shared memory with an untrusted program is the overall problem I'm working on. Locking is impossible: I can't control what the untrusted program does. – Gilles 'SO- stop being evil' Apr 18 '23 at 18:18
  • I still cannot believe your assumption of ensuring data consistency by preventing multiple reads to the same address but allowing single read access to any address in any order. Example, assuming reading an input buffer is not atomic: T1 sender writes two atomic elements resulting in state `AB`, T2 receiver reads the 1st element and has state `A?`, T3 sender modifies the 1st element --> `XB`, T4 sender modifies the 2nd element --> `XY`, T5 receiver reads the 2nd element and gets state `AY` that was never present from the sender's point of view, and might not even be valid. – Bodo Apr 19 '23 at 14:45
  • @Bodo You're trying to map my context to a problem space you know well — but my problem space is a different one. For example, the fact that you talk about T1/T2/T3/T4 tells me you're thinking of threads, but in my world, the entities involved are processes (or virtual machines, or containers) which communicate but do not trust each other. If the sender does crazy things with the data in shared memory, they'll end up with a garbage response: that's their problem. My problem is, if the sender does crazy things, it must not affect the security of my program. – Gilles 'SO- stop being evil' Apr 19 '23 at 15:57
  • @Bodo While the real-world problem could be described as “data consistency”, it's not “data consistency” in the sense of distributed databases or caching. In any case, you don't really need to understand the context to answer the question. The reason I explain the context is to filter “out of the box” approximate solutions that relax the requirements. If you “cannot believe” that my context makes sense — just take the question at face value or ignore it. – Gilles 'SO- stop being evil' Apr 19 '23 at 15:59
  • T1..T5 are meant as ordered points in time, similar to your 1st comment, and `A` and `B` (and other letters) represent two data units/values that can be accessed atomically each but need two operations to read/write both. "sender" and "receiver" are the two processes/VMs/containers. Example: Exchanging an 8-byte value while a single CPU instruction can only access 4 bytes. Is this kind of consistency problem not relevant in your use case? (I have experience with a real-world shared-memory interface between two independent CPUs of different architecture in an industrial embedded system.) – Bodo Apr 19 '23 at 17:06
  • @Bodo Atomicity is not relevant here, I only care about consistency _as seen from my process_, not about consistency in the system in general. If the other process that has access to the shared memory wants to shoot itself in the foot, that's its problem and I don't care. I need to ensure that if the other process shoots, _my_ foot doesn't get hurt. I've just added a good/bad example, if the problem statement wasn't clear. – Gilles 'SO- stop being evil' Apr 19 '23 at 17:09
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/253241/discussion-between-bodo-and-gilles-so-stop-being-evil). – Bodo Apr 19 '23 at 17:10
  • @Bodo The problem here is that between the first read from untrusted memory where the check is performed, and the second read where the act is done, the untrusted code can modify the contents of the buffer being read. ISTR several CVEs against LSM (AppArmor, maybe?) where there were vulnerabilities from doing a double read - LSM reads filename from user space to check access permissions, process changes the user-space buffer, then the actual system call reads a different filename from the same buffer. – Andrew Henle Apr 19 '23 at 17:23
  • As I understand now, you assume you already verified that all possible problems caused by inconsistent data are sufficiently handled by your library already, and only additional problems caused by possible double read from shared memory need to be tested. Valgrind, AFAIK checks access to heap arrays only, and it needs to detect invalid access only. Vague ideas how this might be possible: `mprotect` with dirty signal handler tricks, see https://stackoverflow.com/a/2663575/10622916, implementing a Valgrind tool plug-in, see https://valgrind.org/docs/manual/writing-tools.html. Maybe code review. – Bodo Apr 19 '23 at 19:09
  • I think you meant to write `putchar(input[i]);` in your bad function example instead of `putchar(c);`. – Marco Jul 17 '23 at 18:30

0 Answers0