Alright, let's look at one way you can brute-force your way to a collection of unique words without the use of a binary-tree or hash-table, just using a good old for
loop to check all existing words.
As mentioned in the comments above, you do not need a special case for the 1st word. Simply using a for
loop and looping for (i=0; i<n; i++)
will skip the loop entirely when n
is zero.
In your code above, you have another logic error pertaining to when n
is incremented. Since you are concerned with duplicate words, you only want to increment n
(the number of words in your array) when you add a word, not on every iteration.
There is another optional consideration not specified in your question. Since you are only concerned with a case-insensitive comparison, do you need to print out the word in its original case before conversion to lower-case? A second array holding the original is a convenient way to preserve it. You can either use a struct
to hold the str
and lowercase_str
(and use an array of those), or you can simply use two 2D arrays where the common index coordinates words from each. (an array of struct
is the more elegant solution)
(Note: a 100 x 100
character array is just 10k of storage - trivial with a 1M stack minimum default on Windows, more trivial with the default 4M stack on Linux. You could have roughly 100 of them on windows...)
As also mentioned in the comments, stricmp()
(e.g. strcasecmp()
) isn't the speediest of functions, and is not Standard C (it's POSIX). You can simply use a loop to convert to lower-case using tolower()
and ensure portability. (though virtually all compilers also support the POSIX functions -- with a proper define)
That said, you can change your code similar to the following to obtain the unique words in your file. Note: you should take the filename as the first argument to your program (or read from stdin
by default). Another alternative is to prompt the user and read the filename as input. You should not need to recompile your code just to read from a different input file.
Putting all the pieces together, you can do something similar to the following (with the optional 2nd words array preserving the original case for output), e.g.
#include <stdio.h>
#include <string.h>
#include <ctype.h>
#define N 100
/* provide filename as 1st argument or read stdin by default */
int main (int argc, char **argv)
{
int i = 0,
n = 0,
maxlen = 0; /* keep longest word length for formatting output */
char str[N], /* buffer to hold string from file */
lcstr[N], /* buffer to hold lower-case string */
words[N][N], /* 2D array to hold 100 words of 99 char */
lcwords[N][N]; /* 2D array to hold 100 lowercase words of 99 char */
/* use filename provided as 1st argument (stdin by default) */
FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;
if (!fp) { /* validate file open for reading */
perror ("file open failed");
return 1;
}
/* loop continually reading word up to 99-chars from file.
* You cannot use "%s" to fill an array without protecting the array
* bounds by using the field-width modifier. Protect your words arrays
* by checking n < N.
*/
while (n < N && fscanf (fp, "%99s", str) == 1) {
int found = 0; /* found flag */
/* convert to lowercase in lcstr */
for (i = 0; str[i]; i++) {
lcstr[i] = tolower((unsigned char)str[i]);
}
lcstr[i] = 0; /* don't forget to nul-terminate lcstr */
if (i > maxlen) { /* check and save maxlen */
maxlen = i;
}
/* loop checking if word exist in words */
for (i = 0; i < n; i++) {
if (strcmp(lcwords[i], lcstr) == 0) {
found = 1;
break;
}
}
/* add word to words and lcwords array if not found.
* only increment n when adding a word to array.
*/
if (!found) {
strcpy (words[n], str);
strcpy (lcwords[n++], lcstr);
}
}
if (fp != stdin) /* close file if not stdin */
fclose (fp);
puts ("\nunique words:");
for (i = 0; i < n; i++) {
printf (" %-*s lowercase: %s\n", maxlen, words[i], lcwords[i]);
}
}
Two big notes:
(1) you have nothing that needs #include <stdlib.h>
. Only include headers that you need. When you use a function -- look it up in the man-page (e.g. at man7.org - Online Man Pages. The man page for each function (while cryptic at first) provides concise information on how to use each function -- including the header(s) required;
(2) you cannot use "%s"
alone when reading characters into an array. You must use the field-width modifier to protect from writing beyond the array bounds. In your case with a max of 100-characters, you must limit the read by fscanf()
to 99
characters with "%99s"
(to ensure 1-char of storage remains for the nul-terminating character '\0'
(equivalent to plain-old 0
).
Example Use/Output
With your data in the file dat/uniquewords.txt
, you could do:
$ ./bin/unique_words dat/uniquewords.txt
unique words:
Mi lowercase: mi
illumino lowercase: illumino
di lowercase: di
immenso lowercase: immenso
Which outputs the unique words in their original case preserved in the words
array. The array of lowercase words lcwords
is used for comparison for duplicates. Also note how the found
flag is used to only add words not found (you could use i == n
as well -- up to you)
Let me know if you have questions.
Using Array of struct To Add Occurrence Count
Per your comment indicating you need to capture the occurrence count
, you can do so by simply adding an else
to the if (!found)
block. Essentially, your logic is:
/* add word to words and lcwords array if not found.
* only increment n when adding a word to array.
*/
if (!found) {
words[n] = tmp; /* assign temp struct to words */
words[n++].count = 1; /* set count at 1 */
}
else { /* otherwise */
words[i].count += 1; /* increment occurrence count at i */
}
Rewriting to use an array of struct (optionally holding the original form of the 1st occurrence encountered) and lowercase form, you could rearrange as follows, adding the occurrence count
as well:
#include <stdio.h>
#include <string.h>
#include <ctype.h>
#define N 100
typedef struct { /* struct holding word, lowercase word and count */
char word[N],
lcword[N];
int count;
} word_t;
/* provide filename as 1st argument or read stdin by default */
int main (int argc, char **argv)
{
int i = 0,
n = 0,
maxlen = 0; /* keep longest word length for formatting output */
char str[N]; /* buffer to hold string read from file */
word_t words[N]; /* array of struct to hold 100 words of 99 char */
/* use filename provided as 1st argument (stdin by default) */
FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;
if (!fp) { /* validate file open for reading */
perror ("file open failed");
return 1;
}
/* loop continually reading word up to 99-chars from file.
* You cannot use "%s" to fill an array without protecting the array
* bounds by using the field-width modifier. Protect your words arrays
* by checking n < N.
*/
while (n < N && fscanf (fp, "%99s", str) == 1) {
int found = 0; /* found flag */
word_t tmp = { .word = "" }; /* temporary struct to fill */
/* fill temp struct, convert to lowercase in lcword */
for (i = 0; str[i]; i++) {
tmp.word[i] = str[i];
tmp.lcword[i] = tolower((unsigned char)str[i]);
}
tmp.word[i] = 0; /* don't forget to nul-terminate lcstr */
tmp.lcword[i] = 0;
if (i > maxlen) { /* check and save maxlen */
maxlen = i;
}
/* loop checking if word exist in words */
for (i = 0; i < n; i++) {
if (strcmp (words[i].lcword, tmp.lcword) == 0) {
found = 1;
break;
}
}
/* add word to words and lcwords array if not found.
* only increment n when adding a word to array.
*/
if (!found) {
words[n] = tmp; /* assign temp struct to words */
words[n++].count = 1; /* set count at 1 */
}
else { /* otherwise */
words[i].count += 1; /* increment occurrence count at i */
}
}
if (fp != stdin) /* close file if not stdin */
fclose (fp);
puts ("\nunique words:"); /* output unique words */
for (i = 0; i < n; i++) {
printf (" %-*s count: %2d lowercase: %s\n",
maxlen, words[i].word, words[i].count, words[i].lcword);
}
}
Example Use/Output
Using the same datafile, the output would now be:
$ ./bin/unique_words_struct dat/uniquewords.txt
unique words:
Mi count: 1 lowercase: mi
illumino count: 2 lowercase: illumino
di count: 3 lowercase: di
immenso count: 4 lowercase: immenso
Let me know if that will do it for you.