vectorized strlen getting away with reading unallocated memory

Question

While studying OSX 10.9.4's implementation of strlen, I notice that it always compares a chunk of 16-bytes and skips ahead to the following 16-bytes until it encounters a '\0'. The relevant part:

3de0:   48 83 c7 10             add    $0x10,%rdi
3de4:   66 0f ef c0             pxor   %xmm0,%xmm0
3de8:   66 0f 74 07             pcmpeqb (%rdi),%xmm0
3dec:   66 0f d7 f0             pmovmskb %xmm0,%esi
3df0:   85 f6                   test   %esi,%esi
3df2:   74 ec                   je     3de0 <__platform_strlen+0x40>

0x10 is 16 bytes in hex.

When I saw that, I was wondering: this memory could just as well not be allocated. If I had allocated a C string of 20 bytes and passed it to strlen, it would read 36 bytes of memory. Why is it allowed to do that? I started looking and found How dangerous is it to access an array out of bounds?

Which confirmed that it's definitely not always a good thing, unallocated memory might be unmapped, for example. Yet, there must be something that makes this work. Some of my hypotheses:

OSX not only guarantees that its allocations are 16-byte aligned, but also that the "quantum" of an allocated is a 16-byte chunks. Said another way, allocating 5 bytes will actually allocate 16 bytes. Allocating 20 bytes will actually allocate 32 bytes.
It's not harmful per se to read of the end of an array when you're writing asm, as it's not undefined behaviour, as long as its within bounds (within a page?).

What's the actual reason?

EDIT: just found Why I'm getting read and write permission on unallocated memory?, which seems to indicate my first guess was right.

EDIT 2: Stupidly enough, I had forgotten that even though Apple seems to have removed the source of most of its asm implementations (Where did OSX's x86-64 assembly libc routines go?), it left strlen: http://www.opensource.apple.com/source/Libc/Libc-997.90.3/x86_64/string/strlen.s

In the comments we find:

//  returns the length of the string s (i.e. the distance in bytes from
//  s to the first NUL byte following s).  We look for NUL bytes using
//  pcmpeqb on 16-byte aligned blocks.  Although this may read past the
//  end of the string, because all access is aligned, it will never
//  read past the end of the string across a page boundary, or even
//  accross a cacheline.

EDIT: I honestly think all answerers deserved an accepted answer, and basically all contained the information necessary to understand the issue. So I went for the answer of the person that had the least reputation.

essentially a duplicate of [Is it safe to read past the end of a buffer within the same page on x86 and x64?](https://stackoverflow.com/q/37800739) - definitely safe in asm for aligned loads, but technically UB in C. — Peter Cordes, Jun 06 '20 at 17:20

score 16 · Answer 1 · answered Aug 29 '14 at 18:30

I'm the author of the routine in question.

As some others have said, the key thing is that the reads are all aligned. While reading outside the bounds of an array is undefined behavior in C, we're not writing C; we know lots of details of the x86 architecture beyond what the C abstract machine defines.

In particular, reads beyond the end of a buffer are safe (meaning they cannot produce a trap or other observable side effect) so long as they do not cross a page boundary (because memory attributes and mappings are tracked at page granularity). Since the smallest supported page size is 4096 bytes, an aligned 16 byte load cannot cross a page boundary.

This is indeed a great summary of what the other posters have said. Thanks a lot for commenting! And I must also say that I quite like reading your handiwork, great stuff. I'm supposing you can't tell me anything about http://stackoverflow.com/questions/25505371/where-did-osxs-x86-64-assembly-libc-routines-go otherwise you would've already indicated it. No biggie though, dumping the libsystem.dylib works fine for me. Looking forward to taking a gander at the OSX 10.10 implementations! — Aktau, Aug 30 '14 at 11:12

score 6 · Accepted Answer · answered Aug 29 '14 at 11:52

Reading memory on most architectures only has a side effect if the address being read corresponds to a page that is not mapped. Most strlen implementations for modern computers try to do only aligned reads of however-many bytes. They will never do a 16-byte read straddling two pages, and so they will never elicit any side effect. So it's cool.

score 4 · Answer 3 · answered Aug 29 '14 at 11:37

4

How malloc aligns stuff is irrelevant, since the programmer may allocate a string inside a block. A simple example is a struct that has an embedded char array:

struct Foo
{
    int bar;
    char baz[10];
};

If you allocate an instance of this struct, it will take up 16 bytes, but baz will start at offset 4. Thus if you read 16 bytes from there, you will cross into the next 16 byte chunk that you don't own. If you are unlucky, that may lie in the next page and trigger a fault.

Also, strings don't have to be in the heap at all, such as constants that are in read-only data section. strlen must work in all cases.

I assume the strlen function first processes the initial portion of the string until it's 16 byte aligned (this code has been omitted from the question) and then proceeds in 16 byte chunks. As such, the actual reason this works is reason #2: you won't cross a page boundary which is the granularity of access checking by the processor.

answered Aug 29 '14 at 11:37

Jester

56,577
4
81
125

I linked to the full implementation (try clicking on the `strlen` in the first phrase of my question). It does indeed have such a prologue, but that's neither here nor there, and not the important part. You're quite right that it's not necessarily a malloc'ed piece of memory being strlen'ed. So that, indeed, indicates that #2 might be the reason. – Aktau Aug 29 '14 at 11:40
By the way, I just remembered that `strlen` is one of the few implementations in OSX 10.9.4 we do have the source for: http://www.opensource.apple.com/source/Libc/Libc-997.90.3/x86_64/string/strlen.s – Aktau Aug 29 '14 at 11:44
@Aktau: The prologue for unaligned strings *is* the important part; it's what makes the 16-byte access OK. – tmyklebu Aug 29 '14 at 11:54
where `baz` will start is subject to alignment (either [attribute](https://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Type-Attributes.html) or [#pragma pack](https://gcc.gnu.org/onlinedocs/gcc/Structure-Packing-Pragmas.html) ). – Remus Rusanu Aug 29 '14 at 12:00
@tmyklebu I thought it was implied that there was such an alignment prologue, otherwise those SSE instructions generate an exception I believe. I just showed the important main loop because it illustrated reading further than the end of the string. Most of the SSE-code I've seen includes such a prologue, so I didn't deem it important to include (but did link to the full implementation). – Aktau Aug 29 '14 at 12:07
@Aktau: Yeah, they blow up. But "allocated" is a C-level concept that has no meaning to the implementation of `strlen`. – tmyklebu Aug 29 '14 at 12:11
@tmyklebu: indeed it is, and I have/had some notions about how it all relates to mapping virtual pages to real pages. But not what the OS protects against (and so on and so forth). Which is -- as I have now found out -- the real crux of this question. – Aktau Aug 29 '14 at 12:13

Remus Rusanu · Answer 4 · 2014-08-29T10:31:36.110

2

Allocating Small Memory Blocks Using Malloc

...

When allocating any small blocks of memory, remember that the granularity for blocks allocated by the malloc library is 16 bytes. Thus, the smallest block of memory you can allocate is 16 bytes and any blocks larger than that are a multiple of 16. For example, if you call malloc and ask for 4 bytes, it returns a block whose size is 16 bytes; if you request 24 bytes, it returns a block whose size is 32 bytes. Because of this granularity, you should design your data structures carefully and try to make them multiples of 16 bytes whenever possible.

For the record, referencing (reading) past allocation could trigger a page fault if you cross a page boundary. This is how guardmalloc works:

Each malloc allocation is placed on its own virtual memory page (or pages). By default, the returned address for the allocation is positioned such that the end of the allocated buffer is at the end of the last page, and the next page after that is kept unallocated. Thus, accesses beyond the end of the buffer cause a bad access error immediately.

Also read the vectorized instructions explicit reference in the same page:

As of Mac OS X 10.5, libgmalloc aligns the start of allocated buffers on 16-byte boundaries by default, to allow proper use of vector instructions (e.g., SSE). (The use of vector instructions is common, including in some Mac OS X system libraries.

ps. afaik NSObject and friends share the heap implementation with malloc

edited Aug 29 '14 at 10:31

answered Aug 29 '14 at 10:24

Remus Rusanu

288,378
40
442
569

Thanks for the answer! Do you have any idea about how this is done on other platforms? Could the same strlen work on Windows and/or Linux (glibc)? – Aktau Aug 29 '14 at 11:36
No. For example [HeapAlloc](http://msdn.microsoft.com/en-us/library/windows/desktop/aa366597(v=vs.85).aspx) does not guarantees any alignment, but is well know that 32bit Windows returns an 8 byte aligned address, not 16 as SSE would need. – Remus Rusanu Aug 29 '14 at 11:56
1

Yet the same strlen would work because of the prologue right? When looking at the commented version, they even show what I suspected from a first read: they makes out the bits that can't belong to the string (before the start address passed in as an argument). – Aktau Aug 29 '14 at 12:05
Obviously, code that is ready to handle unaligned input (via a prologue) will work on any address. – Remus Rusanu Aug 29 '14 at 12:09
Yes, though I was already familiar with that and didn't inquire about it. I was wondering more about reading off the end of the string. – Aktau Aug 29 '14 at 12:10
I don't think there is anything 'special' about the strlen handling of the input tail. consider you approach the page end, either `\0` is in the last 16 bytes block, in which case is OK to deference the entire 16 bytes block, or `\0` is not found, next block is read and the page is crossed and fault is triggered, which would had happen also on a byte-by-byte implementation. – Remus Rusanu Aug 29 '14 at 12:14
1

Well, this is exactly what this question was enquiring about. From the implementation it's clear that it's totally ok to read aligned 16-bytes at a time. I was just wondering why/how. By just saying that there's nothing special about it, well that's kind of like mathematicians saying the steps of the derivation are trivial and thus shall be omitted. (one particularly egregious example being Trefethen and Bau's analysis of the Conjugate Gradients method, many parts of which were not trivial to me, yet marked as such). – Aktau Aug 29 '14 at 12:18
1

The gist of it is an aligned 16 bytes block cannot span a two pages. You have to think in hex. One ends with `0`, the other with `000`. Obviously a page boundary is always also a 16 bytes block boundary. – Remus Rusanu Aug 29 '14 at 12:24
That's the short of it indeed: 1. A 16-byte aligned block cannot cross a page boundary 2. The OS only really reacts when reading from a page that hasn't been mapped for the reading process. The page in which the string resides is always mapped of course, so strlen can't do anything wrong by reading 16-byte aligned blocks. Correct me if I've aggregated the information graciously provided by the different posters in the wrong way. – Aktau Aug 29 '14 at 12:27
@tmyklebu my memory is a bit fuzzy, it might've also been in the GMRES analysis. But I distinctly remember spending about 2 hours of time I didn't have before the exams deriving something that was "trivial" according to T. and B., and it was like 13 steps of non-trivial things. At least they were non-trivial to me, perhaps the numerical math force is just not strong enough with me (though I did end up getting 18/20 on that course, which qualifies as acing it, to me). – Aktau Aug 30 '14 at 11:15

score 2 · Answer 5 · answered Aug 29 '14 at 12:00

Note that it is an aligned read (implied by it being part of a non-vex-encoded instruction that isn't explicitly an unaligned read). That means that while it may (and often does) read beyond the end of the string, it will always stay on a page that the string is on.

vectorized strlen getting away with reading unallocated memory

5 Answers5

Linked