creating a For Loop to fill an empty buffer full of dashes in assembly

Question

So this is what I have so far, and every time I get to the loop part of the code, I get Segmentation fault (core dumped). Is this because some of my registers are holding things that are an incorrect size?

 .data
   Welcome:
        .ascii "Welcome to League of Legends!!.\n\0"
  Instruction:
        .ascii "Player 1, enter a Champion's name: \0"
  Text:
        .space 12
  Text2:
        .space 12

  Guess:
        .ascii "Guess a letter: \0"
  Letter:
        .space 1
  SecretCharacter:
        .ascii "Your Champion is: \0"

  .text

  .global _start

  _start:
         mov $Welcome, %rax
        call PrintCString

        mov $Instruction, %rax
        call PrintCString
        mov $Text, %rax
        mov $12, %rbx
        call ScanCString
        mov %rax, %rbx
        mov %rax, %rbp
        call LengthCString
        mov %rax, %rcx


        mov $0, %rdi
        mov $45, %ch

Loop:
        cmp %rcx, %rdi
        jge End

        mov $Text2, %eax
        movb %ch, (%rax, %rdi)
        add $1, %rdi
        jmp Loop

End:
        call PrintCString
        call EndProgram

`mov $Text2, %eax` should be `mov $Text2, %rax`. Also, it should be above the loop instead of inside it. — prl, Dec 09 '17 at 05:01
@prl: Yes it should be outside the loop, but on Linux static addresses are guaranteed to be in the low 32 bits of address space, so it saves code size to use 5-byte zero-extended `mov r32, imm32` instead of 7-byte `mov r/m64, sign-extended-imm32`. (GAS doesn't make that optimization for you, even for numeric constants like `$1`, let alone link-time constants). For a position-independent executable (or OS X where static addresses don't fit in 32-bits even though PIC isn't required), you would use `lea Text2(%rip), %rax`. — Peter Cordes, Dec 09 '17 at 06:25
Well, then, in that case `mov $Welcome, %rax` and a few others are wrong. I knew it was one or the other. — prl, Dec 09 '17 at 07:02
@prl: Well, inefficient anyway. But that and `mov $0, %rdi` isn't "wrong" for a beginner that doesn't care about efficiency. [`xor %edi,%edi` is obviously better](https://stackoverflow.com/questions/33666617/what-is-the-best-way-to-set-a-register-to-zero-in-x86-assembly-xor-mov-or-and), though. — Peter Cordes, Dec 09 '17 at 07:21

score 1 · Accepted Answer · answered Dec 09 '17 at 04:50

1

You're using both rcx and ch at the same time, but ch is part of rcx. Try using dh instead of ch.

answered Dec 09 '17 at 04:50

prl

11,716
2
13
31

1

@Bobbin: Or better, `%dl` avoids [extra latency from reading high-8 registers.](https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to). As a rule, only use AH/BH/CH/DH to get at the bytes of a larger value after loading the full register. (e.g. `mov (%rdi), %eax` ; `movzbl %al, %ecx` ; `movzbl %ah, %edx` ; `shr $16, %eax` ; ...) – Peter Cordes Dec 09 '17 at 06:21
@PeterCordes Wasn't it fine to use ah, bh, ch, and dh when you consistently use the low and high parts separately? – fuz Dec 09 '17 at 11:38
1

@fuz: On CPUs which do rename at least the high partial registers separately, yes. (On AMD, and on Silvermont / KNL, writing `ah` / `al` both have a false dep on rax (and thus each other)). On HSW/SKL, dirtying DH presumably consumes an extra physical register file entry. I think while it's dirty, there's no latency penalty for reading it repeatedly. But if an interrupt handler saves/restores it, the interrupt handler will suffer the penalty of the merge uop having to issue in a cycle by itself, and then after it's merged reading DH has an extra 1c of latency. Tiny but strictly worse than DL – Peter Cordes Dec 09 '17 at 12:16
@PeterCordes well as [this answer](https://stackoverflow.com/q/45660139/149138) shows it's not like the low registers are strictly better than the high registers: since they are renamed separately you avoid the continual false dependency when overwritten (e.g., `mov` to `ah` can run at 4/cycle, but `mov` to `al` only 1 per cycle since it is falsely depending on the high bits of `eax`), although that doesn't matter here. – BeeOnRope Dec 09 '17 at 16:20
@BeeOnRope: yes, with `mov ah, r/m8`, but not with `mov ah, imm8` or with `setcc ah`. But if you're writing portably-performant code, you have to be prepared for Ryzen or whatever where writing AH always has a false dep on RAX. So yes, the low8 regs aren't always better or equal than high8, but in general I'd definitely recommend not writing to high8 registers unless you have a reason. – Peter Cordes Dec 09 '17 at 16:25
@PeterCordes Well being "prepared" for Ryzen doesn't mean you have to change your code: you could weigh it based on the market share of various CPUs in your target market. That mostly means that Sandy Bridge and derivatives dominate, so it would have to be really bad on Ryzen to change, right? The false dependency is a pretty big thing, not really because it limits the `mov` to 1/cycle really, but because it might link two otherwise unrelated dependency chains. You have to be careful _both_ when using high _and_ low parts as they have different issues. – BeeOnRope Dec 09 '17 at 16:27
@BeeOnRope: If it couples something into a long loop-carried dep chain, then that's probably "really bad". And if not then it doesn't matter, so the dep-breaking of writing a high8 reg shouldn't matter. (The merge is only 1c latency). IDK, maybe there are use-cases. Don't write high8 regs in high-performance code unless you've looked at how it will perform on SnB-family, Nehalem, and Ryzen, and usually avoid it in general; you're usually not missing out on anything great. – Peter Cordes Dec 09 '17 at 16:31
It doesn't come up too much in practice since a reasonable approach is to use neither `ah` or `al` but simply `eax` with an initial `movzx` or whatever, which avoids most of the downsides and works for "most" operations. If you didn't have that option though I'd say, based on your investigation, that `ah` behaves _more_ like the independent byte register than `al`. It doesn't have to be a big loop carried dependency chain: _any_ longer dependency chain often hurts scheduling. – BeeOnRope Dec 09 '17 at 16:34

creating a For Loop to fill an empty buffer full of dashes in assembly

1 Answers1