Turbo C / VGA x86 assembly: Copy from ram to vram

Question

I'm just having fun with turbo c to draw "sprites" on an 8086/286 (emulated with pcem) with an MCGA/VGA card.

Compiled with turbo c 3.0 it should work on real 8086 with MCGA. I'm not using the VGA mode x because it is a bit complex and I don't need extra vram for the things I want to do, even if there is some flickering on the screen, it's ok :).

In C, I have a bunch of memcpys moving data from the loaded sprite struct to the VGA in mode 13:

byte *VGA=(byte *)0xA0000000L;    
typedef struct tagSPRITE             
{
    word width;
    word height;
    byte *data;
} SPRITE;

void draw_sprite(SPRITE *sprite){
    int i = 0; int j = 0; 
    for(j=0;j<16;j++){
        memcpy(&VGA[0],&sprite->data[i],16);
        screen_offset+=320;
        i+=16;
    }
}

The goal is to convert that code to a specific assembly function to speed things just a bit.

(editor's note: this was the original asm attempt and text that an answer was based on. See the revision history to see what happened to this question. It was all removed in the last edit, making only the asker's own answer make sense, so this edit tries to make both answers make sense.)

I tried to write it in assembly with something like this, which I'm sure has huge mistakes:

void draw_sprite(SPRITE *sprite){
    asm{
        mov ax,0A000h
        mov es,ax           /* ES points to the video memory */

        mov di,0            /* ES + DI = destination video memory */
        mov si,[sprite.data]/* source memory ram ???*/
        mov cx,16           /* bytes to copy */

        rep movsb           /* move 16 bytes from ds:si to es:di (I think this is the same as memcpy)*/

        add di,320          /* next scanline in vram */         
        add si,16           /* next scanline of the sprite*/
        mov cx,16   

        rep movsb           /* memcpy */

        /*etc*/
    }
}

I know the ram address can't be stored in a 16 bit register because it is bigger than 64k, so mov si,[sprite.data] is not going to work.

So How do I pass the ram address to the si register? (if it's possible).

I know I have to use ds and si registers to set something like a "bank" in "ds", and then, the "si" register can read a 64k chunk of the ram, (so that movsb can move ds:si to es:di). But I just don't know how it works.

I also wonder if that asm code would be faster than the c code (on an 8086 8 Mhz, or a 286), because you don't have to repeat the first part every loop.

I'm not copying from vram to vram for the moment, because I'd have to use the mode X and that's another story.

Passing arguments is dependent on the calling convention and (influenced by the memory model). Turbo-C by default uses CDECL calling convention. Parameters are passed right to left in the stack. In the large memory model code and data are FAR pointers so they are 32-bit values (segment and offset). Your assembly function will have to reference the stack so you use BP to reference stack values. So your function should push BP and then copy SP to BP. [BP+0] is the previous value of BP, [BP+2] is the return address (32-bit FAR pointer), [BP+6] is a FAR pointer to your sprite object. — Michael Petch, Sep 14 '18 at 23:54
To get access to sprite->data you need to get the FAR pointer from [BP+6] and get the address of sprite->data that will be offset by 4 bytes in the SPRITE data type. The push BP and copying SP to BP (and reverse at ret) should be automatically generated by the compiler for the draw_sprite function so you shouldn't have to manually do it in the ASM code block. You can use LDS/LES/LSS to load a 32-bit far pointer from memory. The segment will be placed in the segment the instruction is associated with (LDS=DS,LES=ES,LSS=SS). The offset will be placed in the register specified by the instruction. — Michael Petch, Sep 14 '18 at 23:55
Yes the compiler generates `les bx,[bp+6]` `les si,es:[bx+4]`. :). I assume this loads the pointer to `es:si` right?. If I try to load it to `ds:si` (for the movsw instruction) like this: `lds bx,[bp+6]` `lds si,ds:[bx+4]` The program freezes :) — Mills, Sep 15 '18 at 08:58
I'd have to see all the code you are using. But you will have to save and restore DS (push DS at start of ASM block, pop DS at end). Have to make sure CX has the right value. — Michael Petch, Sep 15 '18 at 10:07
**Don't use TurboC** in 2018 (it is obsolete and is not conforming to standards like C99 or C11). Use a better, more standard conforming, compiler such as [GCC](http://gcc.gnu.org/) — Basile Starynkevitch, Sep 15 '18 at 10:45
If interested by low-level stuff related to PC hardware, see [OSDEV](http://osdev.org/) — Basile Starynkevitch, Sep 15 '18 at 10:46
If you want more efficient code, use near pointers when possible. It looks like your compiler-generated code is using `les` or `lds` for normal pointers. Loading segment registers is expensive on modern x86, probably even in real mode. On real 8086 or 286, probably not super-expensive. — Peter Cordes, Sep 15 '18 at 10:55
@PeterCordes : He's using the **large** memory model. That is the downside -all pointers are considered FAR, all code is considered FAR. This code has no idea if the pointers are in the same data segment or not. So without seeing the context of the rest of the code it is hard to say how he's using segments or what data is where. — Michael Petch, Sep 15 '18 at 10:56
You don't push DS before BP, you push it after BP. move the `PUSH DS` after the `mov bp, sp` then do the `pop ds` before `mov sp,bp` — Michael Petch, Sep 15 '18 at 11:00
@MichaelPetch: Yeah, I figured that must be the case since the source didn't use the FAR keyword. My point was that manually deciding which pointers need to be FAR or not would probably be a speed and code-size optimization. (especially if far-call is / was expensive on 8086 or 286. It definitely is on modern x86.) — Peter Cordes, Sep 15 '18 at 11:00
And as I recall in the old Turbo-C CDECL calling convention SI and DI are non-volatile so like DS you need to save and restore their values too. It has been a lot of years but I could be wrong about that. You can change ES and not have to save/restore it because the ES segment register by default is considered volatile. — Michael Petch, Sep 15 '18 at 11:05
I'd recommend stepping through the code with turbo debugger and see what gets loaded into te segment registers, DI and SI. — Michael Petch, Sep 16 '18 at 12:06
Please don't edit the answer into the question. Now that you have a working answer, post it *as an answer* and leave the question as a question to avoid invalidating existing answers. (If you want to collect up some of MichaelPetch's comments about Turbo C's calling convention and *why* the working asm needs to do what it does, that would make for an even better answer.) — Peter Cordes, Sep 16 '18 at 20:50

Peter Cordes · Answer 1 · 2018-09-15T17:13:25.363

rep movsb increments SI and DI as well as decrementing CX. It's like a memcpy that takes its dst,src by reference and updates them to the end of the copied region.

So you need add di, 320-16, and si is already pointing to the next row of the sprite (because the row stride matches the width = 16).

As far as segmentation, movsb copies from DS:SI to ES:DI, so setting up ES:DI to point at video memory is correct.

Turbo C's calling convention requires/ensures DF=0 on function entry/exit (like normal 32-bit calling conventions), so you don't need a cld to make sure movsb goes in the right direction (forward instead of backward). (If you used std somewhere else and didn't put it back, fix it there to avoid violating the calling convention.)

Turbo C's calling convention also has call-clobbered AX/BC/CX/DX, and ES. (Thanks @MichaelPetch). If its inline asm is anything like MSVC, the compiler will save/restore DI and SI for you. But possibly it doesn't save/restore DS for you, so @MichaelPetch suggests you'll need to push/pop DS to save/restore it yourself. Have a look at the compiler-generated asm to make sure you're following the calling convention.

From your updated question, we can see your build options include memory model = large that makes all pointers into far-pointers, which will be a significant slowdown vs. manually choosing which pointers need to be FAR and others being only 16 bit. But if you don't have any reason to learn about 16-bit real-mode segmentation and all that no-longer-relevant stuff, then sure keep using that. (You might choose a memory model where at least code can be near, so near call/ret only push/pop an IP value, not also a CS.)

You can put the code in a loop, like this.

I have a mix of hard-coding width / height vs. loading it, like your question, but if you calc the row stride in BX (320-width), you have enough registers to hoist the calculations out. The loop branch itself already handles runtime-variable sprite sizes, too.

    push  ds

    xor   di,di             // DI=0

    //mov   si,[sprite.data]  /* source memory ram ???*/
    lds   si,[sprite.data]  // with your build options, everything is a seg:off FAR pointer
    lea   ax, [si + 16*16]  // end_src pointer

    mov   dx, [sprite.width]
    shr   dx, 1              // words to copy = bytes / 2
    // if you can't assume even width, then just use movsb
    // or optimize with rep movsb + a test of the low bit for one movsb

@loop:                    // do {
    mov   cx,dx            /* words to copy */

    rep movsw             /* copy 16 bytes from ds:si to es:di */

    add   di, 320-16      /* starting column in next scanline in vram */         
    // add si, 0          // sprite row stride - width = 0

    cmp   si, ax
    jb   @loop           // } while(src < endsrc);

    pop   ds

Note the use of movsw to copy in 2-byte chunks. x86 before PPro really did just copy 1 byte or 1 word at a time, according to the operand size.

PPro and later have fast-strings microcode that copies in larger chunks. But this has significant startup overhead so for only 16 bytes it would be best on a modern x86 in 16-bit mode to use maybe 4 DWORD integer registers (eax), or qword with x87 fild qword/fistp, or 16-byte with one XMM register.

On an actual 8086 or 286, fild/fistp would be horribly slow compared to integer copies. With a 16-bit data bus you can only copy 2 bytes at a time anyway so rep movsw is good on a real 286.

See also What setup does REP do?

And Enhanced REP MOVSB for memcpy for memcpy on modern x86 (mostly focused on large copies, though.)

Also note that VRAM is typically uncacheable or write-combining, so if you're actually optimizing a copy-to-VRAM routine, multiple narrow stores to the same cache line suck for UC, but are not bad for WC, on a CPU with cache.

Thanks.I fotgot to say, I'm compiling using large model (-ml) and the line "mov si,[spr.data]" is not working. — Mills, Sep 14 '18 at 18:29
@Mills: Maybe try `mov si, sprite.data` without the square brackets. I don't know Turbo-C or MSVC inline asm that well. You could write the whole function in asm and load the args yourself (or better declare it with a register-args calling convention so you get a pointer to the struct in a register). — Peter Cordes, Sep 14 '18 at 18:37
The compiler does not complain if I use `lds si,ds:[spr.data]`. That is supposed to load the address to ds:si. But it just draws black pixels. — Mills, Sep 14 '18 at 20:31
@Mills: But your pointer is just a 16-bit pointer, not a far pointer, so loading seg:offset into ds:si with `lds` would load garbage into `ds`. — Peter Cordes, Sep 14 '18 at 20:36
@Mills: Use a debugger to see how it compiled and what values are in registers. Single-step into your function. Maybe check that `DF=0`. Most calling conventions require DF=0, but I don't know about 16-bit. — Peter Cordes, Sep 14 '18 at 20:42
Thanks, I'll have to read a bit more how it loads the address by default. Converting the C to assembly the compiler writes this `les bx,dword ptr [bp+6]` and `les bx,dword ptr es:[bx+4]`. I'll keep reading, now at least I understand a bit what it's doing :) — Mills, Sep 14 '18 at 20:42
In Turbo-C's CDECL calling convention DF is assumed to be 0 upon entry to a function and on exit. So if your code changes it to 1 (set) then you have to ensure it is cleared before the function returns. AX BX CX DX ES and the floating point registers are volatile everything else is non-volatile by default (this behaviour can be overridden by compiler options) — Michael Petch, Sep 15 '18 at 11:24

score 1 · Accepted Answer · answered Sep 16 '18 at 22:09

Thanks to Michael Petch, Peter Cordes, and everybody. I got the answer.

The assembly code to copy data to the vga video memory looks like this:

DGROUP          GROUP    _DATA, _BSS
_DATA           SEGMENT WORD PUBLIC 'DATA'
_DATA           ENDS
_BSS            SEGMENT   WORD PUBLIC 'BSS'             
_BSS            ENDS
_TEXT           SEGMENT BYTE PUBLIC 'CODE'
                ASSUME CS:_TEXT,DS:DGROUP,SS:DGROUP

            PUBLIC _draw_sprite       
_draw_sprite    proc    far 
    push bp
    mov bp,sp
    push ds
    push si
    push di
    ;-----------------------------------
    lds     bx,[bp+6]
    lds     si,ds:[bx+4]        ; sprite->data to ds:si
    mov     ax,0A000h
    mov     es,ax                       
    mov     di,0                ; VGA[0] to es:di

    mov     ax,16               ; 16 scan lines
copy_line:  
    mov     cx,8
    rep     movsw               ; copy 16 bytes from ds:si to es:di
    add     di,320-16           ; go to next line of the screen
    dec     ax
    jnz     copy_line
    ;-----------------------------------
    pop di
    pop si
    pop ds
    mov sp,bp
    pop bp
    ret 
_draw_sprite    endp

Declare the function in c as:

    void draw_sprite(SPRITE *spr);

Data stored at spr->data, is an array of numbers (from 0 to 255, storing the color of a pixel).

That code finally draws the 16x16 bitmap at position x = 0, y = 0.

Thanks a lot!

Turbo C / VGA x86 assembly: Copy from ram to vram

2 Answers2