This is a wild guess that may actually be wrong. It is based on the fact that in 32-bit code the relative JMP and CALL instructions you encoded are 5 bytes and in 16-bit code they are 3 bytes. 5 bytes - 3 bytes = 2 bytes. Given that relative JMP and CALL targets are based on the distance from the start of the next instruction it may offer a hint as to what might have gone wrong.
If I take this code:
bits 32
org 0x1000
cli
xor eax,eax
mov eax,0x1f1a
mov esp,eax
sti
jmp 0x1010
push 0x16b4
call 0x14ca
add esp,0x4
jmp 0x101d
And assemble it with:
nasm -f bin stage2.asm -o stage2.bin
And review the 32-bit decoding with:
ndisasm -b32 -o 0x1000 stage2.bin
I get:
00001000 FA cli
00001001 31C0 xor eax,eax
00001003 B81A1F0000 mov eax,0x1f1a
00001008 89C4 mov esp,eax
0000100A FB sti
0000100B E900000000 jmp dword 0x1010
00001010 68B4160000 push dword 0x16b4
00001015 E8B0040000 call dword 0x14ca
0000101A 83C404 add esp,byte +0x4
0000101D E9FBFFFFFF jmp dword 0x101d
This looks correct. If however I decode the same code as 16-bit with:
ndisasm -b16 -o 0x1000 stage2.bin
I get:
00001000 FA cli
00001001 31C0 xor ax,ax
00001003 B81A1F mov ax,0x1f1a
00001006 0000 add [bx+si],al
00001008 89C4 mov sp,ax
0000100A FB sti
0000100B E90000 jmp word 0x100e
0000100E 0000 add [bx+si],al
00001010 68B416 push word 0x16b4
00001013 0000 add [bx+si],al
00001015 E8B004 call word 0x14c8
00001018 0000 add [bx+si],al
0000101A 83C404 add sp,byte +0x4
0000101D E9FBFF jmp word 0x101b
00001020 FF db 0xff
00001021 FF db 0xff
The instruction decoding is incorrect however the JMPs and CALLs are present and go to the wrong memory locations. This looks awfully like the observations you are seeing.
Without seeing your code I hope that by the time you start executing stage 2 at 0x1000 that you have entered 32-bit protected mode. If you haven't then I suspect that is the root of your problems. I believe 32-bit encoded instructions are executing in 16-bit real mode.
Update
From the comments the OP suggests they entered 32-bit protected mode as part of the process of entering unreal mode. They had the belief that unreal mode would still decode instructions as 32-bit code and thus the problem.
You get into unreal mode by entering 32-bit protected mode and return to 16-bit real mode. Unreal mode is still 16-bit real mode with the exception that the limits in the hidden descriptor cache are set to 0xffffffff (4GiB limit). Once returning to 16-bit real mode you'll be able to directly address memory in segments beyond 64KiB using 32-bit addressing, but the code is still running in 16-bit real mode.
If you are writing code for 16-bit unreal mode your compiler and assembler still need to generate 16-bit code. If you intend to write/generate 32-bit code then unreal mode isn't an option and you will need to enter 32-bit protected mode to execute 32-bit code.