How do you determine frequency of words in a user entered string in Assembly Language X86?

Question

I am a complete beginner to assembly language programming. I need help in writing an assembly language program to get a string from the user, count and display the number of times each word occur in the user entered string.

For example if the user types in:

Hello Hello what is new Hello what is not new

Output should be:

Hello   3
what    2
is      2
not     1
new     2

I have the following code below that will give me the frequency of characters within a string. However, I do not how to edit so that I can keep track of words instead of just characters, then be able to display those words with their corresponding frequency.

INCLUDE Irvine32.inc

Get_frequencies PROTO,
    pString:PTR BYTE,   ; points to string
    pTable:PTR DWORD    ; points to frequency table

.data
freqTable DWORD 256 DUP(0)
;aString BYTE 1,2,"This is extremely difficult for the experienced",0

aString BYTE 80 DUP(0),0

str1 BYTE "*** Constructing a Frequency Table *** (DEMO)",
    0dh,0ah,0dh,0ah,
    "Enter between 1 and 80 characters: ",0

.code
main PROC

    call Clrscr
    mov  edx,OFFSET str1
    call WriteString

    mov  ecx,SIZEOF aString - 1
    mov  edx,OFFSET aString
    call ReadString

    INVOKE Get_frequencies, ADDR aString, ADDR freqTable
    call DisplayTable

   exit
   main ENDP

;-------------------------------------------------------------

  Get_frequencies PROC,
    pString:PTR BYTE,   ; points to string
    pTable:PTR DWORD    ; points to frequencey table

;
; Constructs a character frequency table. Each array position
; is indexed by its corresponding ASCII code.
;
; Returns: Each entry in the table contains a count of how
; many times that character occurred in the string.
;-------------------------------------------------------------

mov esi,pString
mov edi,pTable
cld     ; clear Direction flag (forward)

L1: mov eax,0       ; clear upper bits of EAX
   lodsb        ; AL = [ESI], inc ESI
   cmp al,0     ; end of string?
   je  Exit_proc        ; yes: exit
   shl eax,2        ; multiply by 4
   inc DWORD PTR [edi + eax]    ; inc table[AL]
   jmp L1       ; repeat loop

 Exit_proc:
    ret
 Get_frequencies ENDP

  ;-------------------------------------------------------------

 DisplayTable PROC

  ;
  ; Display the non-empty entries of the frequency table.
  ; This procedure was not required, but it makes it easier
  ; to demonstrate that Get_frequencies works.
  ;-------------------------------------------------------------

  .data
  colonStr BYTE ": ",0
  .code
  call Crlf
  mov ecx,LENGTHOF freqTable    ; entries to show
  mov esi,OFFSET freqTable
  mov ebx,0 ; index counter

 L1:    mov eax,[esi]   ; get frequency count
        cmp eax,0   ; count = 0?
        jna L2  ; if so, skip to next entry

         mov eax,ebx    ; display the index
         call WriteChar
         mov edx,OFFSET colonStr    ; display ": "
         call WriteString
         mov eax,[esi]  ; show frequency count
         call WriteDec
         call Crlf

  L2:   add esi,TYPE freqTable  ; point to next table entry
        inc ebx ; increment index
        loop L1

        call Crlf
        ret
      DisplayTable ENDP

      END main

This is what I have so far in my attempt to implement the brute-force search suggested in Peter's answer:

 .data
 str2 BYTE "one two three",0

 .code
  main proc
   mov  edi,OFFSET str2
    Mov esi,edi
    Mov Ecx, 0  ;reset ecx to 0
    Not Ecx     ;set Ecx to -1 or highest possible integer
    Mov Al, ' ' ;Initialize a1 to delimiter of (space) ' '
    Cld         ;Clear Direction Pointer
    Repne Scasb ;scan edi one byte at a time until delimiter found
    Not Ecx
    Lea Eax, [ecx-1] ;Set Eax to index of found delimiter
    Xchg Esi, Edi  ;Take Edi which is now equal to string after found       delimiter and put in esi

    mov edx, esi
    call WriteString    

 main endp
 end main

This prints "two three", but I want it to print "one".

Don't forget to look at what you post, and edit it if it doesn't look right. I had to re-edit your question to separate the new code from the previous code-dump. — Peter Cordes, May 21 '16 at 18:00
You're passing the *end* of the word you found to `WriteString`. That's why it prints the rest of the string. That tells you that your code worked, and did stop after the first word. What args does `WriteString` take? I've never used Irvine32. If it takes a pointer and length, you should be able to use it. Otherwise, if it just prints up to a 0-terminator, you can't use it that way. (You could overwrite the `' '` with a `'\0'` terminator, but that only matters for debug-printing.) — Peter Cordes, May 21 '16 at 18:05
Why don't you just `mov ecx, -1` like a normal person? You'd only use `not ecx` or `dec ecx` to save bytes in generating a -1 by doing `xor ecx,ecx / dec ecx` (3B total in 32bit code, 4B in 64bit code) instead of `mov ecx, -1` (5B). BTW, I checked, and `WriteString` just prints the null-terminated string passed in `edx`. I didn't know why you were calculating something in `eax`. BTW, the useful thing to save is the length of the word for future loops or `rep cmpsb`. (e.g. `mov eax, ecx` / `neg eax`, unless I have an off-by-one error.) The index of the delimiter isn't why it's interesting. — Peter Cordes, May 21 '16 at 18:20

Peter Cordes · Answer 1 · 2021-01-09T06:25:51.593

There's nothing special about asm that would make you choose a different algorithm than if you were doing it in C or any other language you already know.

Is this an assignment? It seems more complex than something you'd usually want to do in asm. However, if it only has to work for short strings, a really bad brute-force algorithm that doesn't use any data structures other than a char buffer would be easier to implement (see below).

There are less than 128 printable ASCII characters, so a table of counts is small and easy.

The difference between characters and words is that the number of possible words is infinite (barring memory-size limitations), so you can't just use words directly as an index into a table of counts. Even using 4-character words as a 4-byte index into a table of counts would make the count table size = 2^32 entries.

The most direct way to adapt the count-table algorithm from character to words is with some kind of dictionary data structure, like a hash table, tree, or even a radix tree / trie. Each entry maps a word to its counter.

The simplest possible "dictionary" of word:count would be an unsorted array of
struct {char *word; int count;} histogram[]; that you linear-search. This is not efficient, but would be easy to implement in asm. Word comparison could be sped up by storing explicit-length strings like int wordlen; char *wordstart so you can just point into your string without having to store 0 terminators. Then you can check if words are even the same length before looking at their bytes, and memcmp can be more efficient than strcmp.

This sucks because it's a linear search of this array of counts for every input word, but it's still probably better than the brute-force way below that scans (and copies) the whole rest of the string for every input word. Keeping this array sorted (insertion-sort style) might or might not be better, allowing O(log(n)) access with a binary search when the word is found. There are probably clever tradeoffs like gapped arrays (with a way to signal unused slots) that can let an insert at a specific place only need to copy a limited amount. Or of course a classic dictionary data structure like a hash table or tree.

Another option is to sort your list of words, and then loop through the sorted list counting duplicates. Or accumulate duplicate-counts while you sort, which is an interesting problem when your input word list is too big to fit in memory all at once.

your current char loop could be nicer:

This is what I might do:

;; tighter version of your char-count loop
   xor   eax, eax                    ; clear upper bits of EAX
L1:
   lodsb                          ; AL = [ESI], inc ESI
   inc   DWORD PTR [edi + eax*4]  ; inc table[AL]
   test  al,al                    ; Set flags based on AL
   jz    L1                       ; loop if it's not the end of the string
   ;; fall through when done
   ;;dec   DWORD PTR [edi]          ; undo the count of the zero if you care

Or more likely a movzx eax, byte [esi] / inc esi, which is more efficient than lodsb, avoiding the cost of merging into the old EAX.

This does count the terminating 0, but saves instructions in the loop. With more code outside the loop, you could avoid this while still keeping the loop small.

Since we left-shift as part of the addressing mode, we don't need to zero eax inside the loop. Using xor-zeroing also avoids a partial-register slowdown when reading eax after writing just al, if you care about performance.

A brute-force algorithm that might be easier to implement in asm

This will perform terribly for long strings, but might be pretty good for very short strings. The main advantage is ease of implementation, but it may be competitive in performance to a tree or hash for very short strings, like 128 bytes.

Find the end of the first word
Search the rest of the string for occurrences of that word. (Remember to match full words only, not like strstr(3))
At every match, copy the rest of the string to the left, removing the word from the string. (Remove the first word, too.)
Repeat until the string is empty

(You don't have to copy the first occurrence out, just advance your pointer so the next search starts at the end of it. If no duplicates are found, you don't need to do any copying.)

This spends a lot of time copying characters around, but we need to avoid duplicates (doing another count for the last n-1 occurrences of a word when we come to it later). Duplicate checking by searching another data structure of already-counted words would let you avoid copying, but if you're going to do that you might as well just do one pass over the string and histogram into that found-words "dictionary", see above.

but that itself costs extra cycles and code, and all future searches would be searching these already-counted words. If you going to use an array of already-found words, you might as well

You can do the copying with rep movsb, which I think still works with overlapping dst and src. (But is not fast if they overlap closely).

Another option is to have a 2nd equal-size buffer, so you only do one copy of the whole string while removing all occurrences of the current word you're counting:

Starting from the 2nd word in string A:

copy characters into string B.
when you reach the end of a word, check if it's a match for the current word you're looking for.
- if yes: count++ and dst-=length_of_current_countword. When you detect the word you're looking for, you rewind your destination pointer (edi) to the beginning, so future copying will overwrite it. I.e. you can un-copy it basically for free at this point.
repeat until you reach the end of string A
print the current word and the count.
If string B isn't empy, it becomes the new string A, etc. Maybe keep pointers to A and B in registers or memory where you can swap them, rather than using the address of the static storage directly. Or just do it the dumb way and copy B to A with no changes.

Hi Peter, thank you for responding. For your brute force algorithm, how should I complete the first few steps you outlined? Should I use delimiter, and split the string at the first occurrence of a space " "? Then once I have the word, should I store it in array or register. Then how do I compare to the rest of the string — coderintraining, May 18 '16 at 16:10
@coderintraining: I was picturing checking for non-word characters to find the end of a word. The simplest would be to find the first `" "`. You can't store a word in a register, because it's potentially infinite. Instead, you keep track of its start position with a pointer. (e.g. before the copy loop, `mov ecx, esi`. When you find an end-of-word, `ecx` points to the start, and `esi-ecx` is the length.). You don't need any extra arrays to copy words, but you might need some extra space to spill pointers / lengths if you run out of registers. — Peter Cordes, May 19 '16 at 00:59
You search the rest of the string using the same algorithm to find words. Then, if they have the same length as the word you're looking for, `memcmp` them with a loop or `rep cmpsb`. It's still a really complicated algorithm to implement in asm without any library functions. Are you allowed to call libc string functions like `strtok`? It's definitely too complicated for me to imagine the whole thing in my head at the "what register holds what" level of detail, but each step on its own is something that's fairly simple in asm. — Peter Cordes, May 19 '16 at 01:00
i cant seem to get the first word into a register. say my string is "one two three", i use the following code, and it only provides me with "two three" — coderintraining, May 21 '16 at 04:27
mov edi,OFFSET str2 Mov esi,edi Mov Ecx, 0 ;reset ecx to 0 Not Ecx ;set Ecx to -1 or highest possible integer Mov Al, ' ' ;Initialize a1 to delimiter of (space) ' ' Cld ;Clear Direction Pointer Repne Scasb ;scan edi one byte at a time until delimiter found Not Ecx Lea Eax, [ecx-1] ;Set Eax to index of found delimiter Xchg Esi, Edi ;Take Edi which is now equal to string after found delimiter and put in esi mov edx, esi call WriteString — coderintraining, May 21 '16 at 04:28
@coderintraining: you know that un-formatted code in comments is unreadable for everyone, right? Edit that into your question. Anyway, you don't load words into registers, just pointers to them. Save a pointer to the start of the buffer before searching for the end of the first word. — Peter Cordes, May 21 '16 at 05:44
Sorry i didnt realize it dididnt format. I just edited the question. My code is at the bottom. If is save the point of where the strings starts, and have the point where the first word ends. Ive tried subtracting the two then putting it in edx to display, but it doesnt work either — coderintraining, May 21 '16 at 17:00

How do you determine frequency of words in a user entered string in Assembly Language X86?

1 Answers1

your current char loop could be nicer:

A brute-force algorithm that might be easier to implement in asm