Remove extra spaces from a string assembly

Question

Got some question from exam in assembly. It goes like this, given string:

str db "  hello world   hello    world   #"

The # sign indicates for end of the string. After manipulation the string should look like this:

"hello world hello world#"

Any algorithm or advice for removing extra white space would be appreciated.

I tried this code:

data segment
str db "  hello world   hello    world   #" data ends     


start:
mov si,offset str
mov di,0

while:
cmp [si],'#'
jne loopwhile
jmp whileT
loopwhile:
inc di
inc si
jmp while

whileT:
mov si,0
while2:
cmp si,di
jae finish

   cmp str[si],32
   je check2
   inc si
   jmp while2
   check2:
   cmp str[si+1],32
   je inner
   inc si 
   jmp while2
      inner:
        mov bx,si
        inc bx
        innerW:
        cmp bx,di
        jae finishInner
        mov al,str[bx+1]
        mov str[bx],al
        inc bx
        jmp innerW 
        finishInner:
        dec di
        jmp while2

finish: 
mov ax,4Ch
int 21h 

code ends

but still i get one extra space at the beginning of the string.

Why should the output contain a space between the first `world` and the second `hello`, but no spaces before the first `hello` or between `world` and `#`? I find the requirements for the program unclear. — Michael, Nov 10 '16 at 07:04
im sorry if it not clear, it suppose to take down all extra whitespace, in other words the output string should look like this:"hello world hello world#" — 2Stoned, Nov 10 '16 at 07:18
Well, your code is designed to always leave one space behind. Apparently you want to remove all spaces in some places (at the start/end of the string?), so you'll have to adjust your code to handle that. — Michael, Nov 10 '16 at 07:30
I tried this solution http://stackoverflow.com/a/38532219/6389176.... — 2Stoned, Nov 10 '16 at 07:32
Like I said, that algorithm always leaves one space behind. If you have different requirements then you're going to have to come up with a different algorithm. — Michael, Nov 10 '16 at 07:44
Yes yes thank you, added another loop to check for space at the beginning. — 2Stoned, Nov 10 '16 at 07:47

Ped7g · Accepted Answer · 2016-11-11T11:24:45.397

simpler (?) (shorter for sure) algorithm:

    mov   ax,SEG str
    mov   ds,ax
    mov   es,ax
    mov   si,OFFSET str
    mov   di,si
    mov   bx,si
    ; ds:si = source pointer to read char by char
    ; es:di = destination pointer to write modified string
    ; bx = str pointer for compare during second phase
    xor   cx,cx  ; cx = 0, counts spaces to copy

copyLoop:
    lodsb           ; al = ds:[si++]
    cmp   al,'#'
    je    removeTrailingSpaces
    cmp   al,' '
    jne   notSpace
    jcxz  copyLoop  ; no more spaces allowed to copy, skip
    ; copy the space
    dec   cx        ; --allowed
    stosb           ; es:[di++] = al
    jmp   copyLoop

notSpace:
    mov   cx,1      ; one space can be copied next time
    stosb           ; copy the not-space char
    jmp   copyLoop

removeTrailingSpaces:
    cmp   di,bx
    je    emptyStringResult
    dec   di
    cmp   BYTE PTR [di],' '
    je    removeTrailingSpaces
    inc   di        ; not-space found, write '#' after it
emptyStringResult:
    stosb           ; write the '#' at final position

    mov   ax,4Ch    ; exit
    int   21h

How it works:

Just copies almost everything from ds:[si] to es:[di], count-downs spaces and skips them when counter is zero. Non-space character resets counter to 1 (so next space after word will be copied).

When '#' is found, it scans end of string for trailing spaces, and writes terminating '#' after last not-space character (or when string is empty).

Talking in comments about how I built this algorithm and how it's not possible to decide whether current word is last - gave me another idea, how to deal with end of string. To cache last known word end position, so after reaching end of source string I can use the cached pointer to directly set the terminator at correct place. Variant 2:

    ; initial code is identical, only function of bx is different, so updated comment:
    ...
    ; bx = str pointer pointing +1 beyond last non-space character
    ; (for empty input string that means OFFSET str to produce "#" result)
    ...

copyLoop:
    lodsb           ; al = ds:[si++]
    cmp   al,'#'
    je    setTerminatorAndExit
    cmp   al,' '
    jne   notSpace
    jcxz  copyLoop  ; no more spaces allowed to copy, skip
    ; copy the space
    dec   cx        ; --allowed
    stosb           ; es:[di++] = al
    jmp   copyLoop

notSpace:
    mov   cx,1      ; one space can be copied next time
    stosb           ; copy the not-space char
    mov   bx,di     ; update bx to point +1 beyond last non-space char
    jmp   copyLoop

setTerminatorAndExit:
    mov   [bx],al   ; write the '#' to cached position of last non-space+1

    mov   ax,4Ch    ; exit
    int   21h

@2Stoned: of course it is? The question is, whether you understand how it works and if it will give you some new ideas for your own programming. In case you have some question about the code, write it here. — Ped7g, Nov 11 '16 at 10:29
Yap got it (cx allowes one space, and remove trail clean the space before #), thank you! very nice solution. — 2Stoned, Nov 11 '16 at 10:52
@2Stoned Yes, and by starting with cx=0 it removes all spaces ahead of first word (or in case of string consisting purely from space characters), so it doesn't need extra loop at the beginning, the logic of main loop fits this purpose. Would you know for sure there's at least one non-space character in the string (never empty one), it would be possible to remove `bx` test completely (the non space character would always stop the trailing loop), but the current version comparing pointers to detect start of string is the best I could think of to handle also strings like `"#"` and `" #"`. — Ped7g, Nov 11 '16 at 10:57
@2Stoned the base idea I was building it upon was "streaming flow of data, filtering out unwanted"... unfortunately it's impossible to decide whether the encountered space after word belongs to the trailing ones ahead of # (not possible to say whether word is last or not), so it's not pure stream processing, it has to track back the end. On the bonus side: if you would set `cx` to `n` on non-space char, it would keep "at most n" spaces between words (if there would be so many of them in source string). — Ped7g, Nov 11 '16 at 11:07
@2Stoned check the edit of answer :) (and that's the last thing I have to it, hopefully, no more notifications from me :P ). — Ped7g, Nov 11 '16 at 11:25

Remove extra spaces from a string assembly

1 Answers1