1

Very new to programming in C so sorry if I have badly misunderstood something. I am currently doing the speller problem set from CS50 if anyone is familiar with this, and I given words from a text to check if they are spelled correctly by comparing them to a given dictionary. I have sorted this dictionary into a hash table with about 17,000 buckets which point to on average, a linked list with a length of around 100 nodes. There may be several hundred thousand words that need to be spell checked.

My question is this, will checking to see if the length of each word from my dictionary matches the length of the word required to be spellchecked, using strlen(), before then using strcmp() only if the lengths match, be faster than just checking if the strings match using strcmp().

I do potentially see that if there are lots of words that have the same length as your word you want to check, checking the length will disadvantage you but I am wondering if the speed increase, if there is one, by checking the length for words with a more uncommon length will make up for this.

Mureinik
  • 297,002
  • 52
  • 306
  • 350
H_Boofer
  • 413
  • 1
  • 3
  • 13

2 Answers2

2

Will comparing string length before checking if string is the same yield me non-negligible speed increases for C?

Either you explicitly keep the string bytes (as a flexible array member) with its length in some struct, and then yes, you could win a tiny bit of performance, or you use strlen which will scan all the bytes. Be aware of CPU cache. Study for inspiration the source code of open source libraries like Glib (they implement hashtables like you do...)

For more, read Modern C and study the source code of open source implementations such as GNU libc and GCC.

A similar question is implementing matrices in C. Then look into this answer.

Actually, you should benchmark.

If you use Linux and GCC, compile with gcc -pg -O2 -Wall then use gprof(1) or time(1) or perf(1) to profile your program. See of course time(7) and syscalls(2).

With other compilers or operating systems, read their documentation.

It may happen that in your code, the performance gains are negligible in practice (a few percents). Most English words have less than 16 bytes, which would fit into an L1 cache line (on current laptop processors in 2020).

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
2

strcmp is an O(n) operation - it iterates over both strings until one of them ends or a mismatching pair of characters is encountered, so at first glance comparing the lengths sounds like a good idea. However, strlen in C is also an O(n) operation - it takes a char* and iterates until it hits a \0 character. So just using strlen naively would in fact probably make your program slower.

Mureinik
  • 297,002
  • 52
  • 306
  • 350