CTypes NASM - how to dereference a pointer to an array of pointers

Question

UPDATE: the code below DOES work to dereference the pointer. I had incorrectly inserted some lines at the entry point that had the effect of overriding the memory location f1_ptr. The important part is that to defererence the pointer when it's stored in a memory location is: mov r15,qword[f1_ptr] / mov rdx,qword[r15+0]. Move memory to r15, then move r15 to rdx. That does it. But as Peter Cordes explains below, memory locations are not thread safe, so best to use registers for pointers at least.

****End of Update****

I am using ctypes to pass a pointer to an array of pointers; each pointer points to the start of a string in a list of names. In the Windows ABI, the pointer is passed as the first parameter in rcx.

On entry to the program, I ordinarily put pointers into memory variables because I can't keep them in the low registers like rcx and rdx; in this case, it's stored as mov [f1_ptr],rcx. But later in the program, when I move from memory to register it doesn't work. In other work with simple pointers (not pointers to an array of pointers), I have no problem.

Based on the answer to an earlier question (Python ctypes how to read a byte from a character array passed to NASM), I found that IF I store rcx in another register on entry (e.g., r15), I can freely use that with no problem downstream in the program. For example, to access the second byte of the second name string:

xor rax,rax
mov rdx,qword[r15+8]
movsx eax,BYTE[rdx+1]
jmp label_900

If instead I mov r15,[f1_ptr] downstream in the program, that doesn't work. To emulate the code above:

xor rax,rax
mov r15,qword[f1_ptr]
mov rdx,qword[r15+8]
movsx eax,BYTE[rdx+1]
jmp label_900

but it not only doesn't work, it crashes.

So the question is: rcx is stored in memory on entry to the program. Later I read it back from memory into r15 and dereference it the same way. Why doesn't it work the same way?

The full code, minus the code segments shown above, is at the link I posted above.

I think you mean "function" instead of "program". This is a Python module with native code, so it's a function called by the python interpreter, right? Note that using `f1_ptr` isn't re-entrant or thread-safe. That's why compilers spill to the stack, not static locations, if they run out of registers. Also note that `r15` is call-preserved, so you'd better make sure to restore the caller's `r15` before returning. — Peter Cordes, Jan 24 '19 at 19:28
You're right - a function returns a value, which this one does, so it's a function. In general, I prefer to use registers to hold all values. I appreciate your comment about memory locations being non-reentrant and thread-safe because that may be the whole issue here. In this case I think I would be best to use the register, but I'm curious to know why single pointers (not to an array) work in memory, or maybe that's just not a good idea in the long run? — RTC222, Jan 24 '19 at 19:41
People with a C background call everything a function, even if the return type is `void` (i.e. returns with no value). If you want to distinguish that, the right terminology is function vs. procedure or subroutine, not "program". — Peter Cordes, Jan 24 '19 at 19:43
40 years ago we referred to most everything as a "program" but now I think a program is just a listing of source code. I realize verbal precision in this matter is important. — RTC222, Jan 24 '19 at 19:46
Your question seems to be showing that keeping pointers in single named static storage locations *doesn't* work, the way you're using them. To find out if it's a good idea, ask yourself "in C would this need to be `static` or global?" If the answer is no, then it shouldn't be in hand-written asm either. Using the stack is usually more compact (smaller than an addressing mode that can address static data, on x86-64) and touches fewer cache lines. And yes of course using registers is much better. — Peter Cordes, Jan 24 '19 at 19:47
I've been a serious Unix/Linux geek since about 1997, and a "program" is something you can run from the shell. Either the source or the executable for something that can run as a whole process. I don't think I've ever heard it used to refer to source-code listings specifically, as opposed to executables, and definitely not for the source code for just a function. (It was fairly clear what you meant in your question, from context of getting an arg in RCX, though, so this is mostly nit-picking) — Peter Cordes, Jan 24 '19 at 19:52
It's sort of like the difference between programming languages as functional, imperative, etc. But the most important thing for me is I learned something from you about storing pointers (not in memory) and the question was really for curiosity because I try to keep everything in registers -- memory is still not really that fast. — RTC222, Jan 24 '19 at 19:56
L1d cache is pretty damn fast, (like 4 or 5 cycle load-use latency) but yeah it's still slower than registers. More importantly, it costs extra instructions / code-size to use. And weird microarchitectural effects can slow down loads sometimes, at least on older CPUs. [Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?](https://stackoverflow.com/q/54084992) shows one case where it looks like IvyBridge can't load from the same cache line in the same cycle it's committing a store. Not a problem on Skylake, though. — Peter Cordes, Jan 24 '19 at 20:04
Anyway, re: this question: not a [mcve]. If the 8 bytes in `[f1_ptr]` really was the same as what you copies to `r15` on function entry, the snippets would be equivalent. The problem lies elsewhere, either something stepping on `f1_ptr` or you're clobbering your caller's `r15`. Use a debugger to find out *where* it crashes; either right there when you try to use the value, or a lot later? — Peter Cordes, Jan 24 '19 at 20:23
You're right. I relied on my post from yesterday for the full source code needed for a minimal, complete and verifiable example, but this morning I inserted these four lines right after saving rcx into f1_ptr: xor r15,r15 / mov r15,rcx / mov rdx,qword[r15] / mov [f1_ptr],rdx. Without those lines, the memory location works correctly now. — RTC222, Jan 24 '19 at 20:47
The difference from yesterday was adding the lines mov r15,qword[f1_ptr] / mov rdx,qword[r15+0] to deref the pointer. I had the answer but then I overrode the answer, and didn't post the newest code because I thought it wasn't changed from yesterday. So it does work with the new line mov rdx,qword[r15] to deref the pointer to rdx. I'll edit my post above so others can see what the answer is. I'm not supposed to say thanks but I'll say it anyway. — RTC222, Jan 24 '19 at 20:47
You don't need to `xor`-zero a register before you do something else write-only to it. 32 and 64-bit `mov` don't have an output dependency, only a few rare cases like `lzcnt` or `popcnt` have false output-dependencies. [Why does breaking the "output dependency" of LZCNT matter?](https://stackoverflow.com/q/21390165) — Peter Cordes, Jan 24 '19 at 20:54
I've gotten into the habit because in this code I'm moving bytes into al, so I zero out rax before the move (abundance of caution). Now I'm using movsx into eax, but I'm not sure about the rax high bytes. So I'm using xor liberally, when it may not be needed. — RTC222, Jan 24 '19 at 20:58
[Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?](//stackoverflow.com/q/11177137) explains why `movzx eax, byte [mem]` zero-extends into the full RAX. (Or `movsx` if you insist; usually it makes more sense to zero-extend character data, but I assume you have a use for it.) You could use `xor eax,eax` / `mov al, [mem]` but that would be strictly worse than `movzx eax, byte [mem]`. And `movzx rax, byte [mem]` would be a waste of a REX prefix. But `movsx rax, byte [mem]` would sign-extend into the full RAX. But usually just use 32-bit. — Peter Cordes, Jan 24 '19 at 21:01
Thanks for that explanation, it makes my code more efficient. — RTC222, Jan 24 '19 at 21:03
Any particular reason for writing asm by hand here? I wouldn't be surprised if you can get a compiler to make asm that's more efficient than what you could write by hand (without significantly more experience). I see stuff in your other question like `movsd xmm0,qword[rbp+rcx] cvttsd2si rax,xmm0` that you could vectorize with `cvttpd2dq xmm1, [rbp+rcx]`, if that's aligned, and if the results fit in signed 32 bits. Or at least use `cvttsd2si rax, [rbp+rcx]` to fold the scalar load into a memory operand. — Peter Cordes, Jan 24 '19 at 21:06
About a year ago I switched from 32-bit MASM to 64-bit NASM, and I'm writing by hand now so I can get a deeper understanding of it. For a long time I used high-level languages with spot speedups, like here, but low-level programming has been a special interest for many years. — RTC222, Jan 24 '19 at 21:11
These days it's often most useful to just tweak C / C++ source so it compiles more efficiently, if your compiler isn't terrible. See [C++ code for testing the Collatz conjecture faster than hand-written assembly - why?](//stackoverflow.com/a/40355466) for more about hand-holding the compiler into making better asm, and how to beat the compiler in that case. — Peter Cordes, Jan 24 '19 at 21:29
Oh, I agree. Reading the source output from an optimizing compiler (e.g., GCC) is very useful. But remember, all compilers were initially written by hand (as far as I know)! — RTC222, Jan 24 '19 at 21:35
BTW, you might be interested in https://stackoverflow.com/questions/721090/what-is-the-difference-between-a-function-and-a-procedure. — RTC222, Jan 24 '19 at 21:38
Yeah of course. Compilers are complex pieces of machinery that apply a lot of collected knowledge of how to create efficient asm for modern CPUs. If you can do a better job than the compiler, then you can report missed-optimization bugs so humans can add code to look for the optimization you suggest, or fix whatever bug was causing it to miss an optimization. And sometimes tweak your C source so it compiles differently, often by making the C logic look like the optimal asm logic you want. The end result is some C that compiles efficiently now, and can be recompiled for future x86 CPUs. — Peter Cordes, Jan 24 '19 at 21:44
This other [SO question](https://stackoverflow.com/questions/43163599/keyboard-interrupt-in-x86-protected-mode-causes-processor-error/43171871#43171871) isn't a duplicate of this but it is an indicator of what can happen if you don't set things up yourself. — Michael Petch, Jan 24 '19 at 22:24
It's a long post and I'll read it all (I just saw your comment). I think what you mean is that it's an example of why it's good to know the system at the lowest level, which is my view. — RTC222, Jan 24 '19 at 23:15

CTypes NASM - how to dereference a pointer to an array of pointers

0 Answers0