One of the most costly tasks you can do in programming is file I/O. You want to minimize the number of file opens and reads (although you do get a default file-buffer of BUFSIZ
chars that helps, 8192 bytes on Linux, 512 on windoze).
The way you want to approach the task is to read read your input once, process it as required, and then write the processed information once to the required files.
Here, according to your answers to my comments, and your edit, you want to determine the number of times each word is seen (max of 16 chars per-word), write the sentence entered by the user to "TF4_1.txt"
and write the word frequency to "TF4_2.txt"
. (the sort order is not specified, add call to qsort
if specific order required)
When you think about coordinating multiple pieces of information of differing types, you should immediately think struct
. For two pieces of data, you can get away with multiple arrays, but generally, an array of struct
that holds the information is preferred. Here you have a word and a count you want to keep for each individual word. You could declare a simple struct to handle your storage needs as follows:
#define MAXC 1024 /* if you need constants, define them */
#define MAXL 32 /* (don't skimp on buffer size) */
#define MAXW 256 /* max chars in buf, word len, no. words */
...
typedef struct { /* struct to associate word and count */
char word[MAXL];
int count;
} wstat;
(a typedef
is used for convenience)
The remainder of the logic is fairly standard for this type problem. You read your sentence, you tokenize the string, (in your case you convert each token to lowercase), you loop over the words you have already stored -- comparing the lowercase token to the stored word. If you find a match, you simply increment the count
for that word, otherwise you copy the lowercase token to the next available element.word
in your array of struct, increment the element.count
and the array of struct index.
You must also take care to protect your array bounds by then comparing the index to the maximum number of elements.
When you are done processing each token, you simply write your array to "TF4_2.txt"
, close the file -- and you are done.
Putting it altogether, you could do something similar to the following:
#include <iostream>
#include <iomanip>
#include <fstream>
#include <cstring>
#include <cctype>
using namespace std;
#define MAXC 1024 /* if you need constants, define them */
#define MAXL 32 /* (don't skimp on buffer size) */
#define MAXW 256 /* max chars in buf, word len, no. words */
#define SENTOUT "TF4_1.txt" /* sentence out filename */
#define STATOUT "TF4_2.txt" /* statistics out filename */
typedef struct { /* struct to associate word and count */
char word[MAXL];
int count;
} wstat;
int main (void) {
char buf[MAXC] = "", /* buffer to hold line */
*p = buf; /* pointer to buffer */
const char *delim = " ,./?!:();"; /* strtok delimiters */
int wcount = 0; /* word count */
wstat wstats[MAXW] = {{"", 0}}; /* word stats array */
/* prompt for input */
cout << "enter sentence (words 16 char or less): ";
if (!(cin.get (buf, MAXC, '\n'))) { /* validate input */
cerr << "error: invalid input or user canceled.\n";
return 1;
}
cout << buf << "\n"; /* output to stdout (optional) */
ofstream f(SENTOUT, ios::trunc); /* open TF4_1.txt for writing */
if (!f.is_open()) { /* validate file open for writing */
cerr << "error: file open failed '" << SENTOUT << "'.\n";
return 1;
}
f << buf << "\n"; /* write sentence to TF4_1.txt */
f.close(); /* close TF4_1.txt */
/* tokenize input */
for (p = strtok (p, delim); p; p = strtok (NULL, delim)) {
int seen = 0; /* flag if word already seen */
char lccopy[MAXL] = "", /* array for lower-case copy */
*rp = p, /* read-pointer to token */
*wp = lccopy; /* write-pointer for copy */
while (*rp) /* iterate over each char */
*wp++ = tolower(*rp++); /* convert to lowercase */
*wp = 0; /* nul-terminate lccopy */
for (int i = 0; i < wcount; i++) /* loop over stored words */
/* compare lccopy to stored words */
if (strcmp (lccopy, wstats[i].word) == 0) { /* already stored */
wstats[i].count++; /* increment count for word */
seen = 1; /* set seen flag */
}
if (!seen) { /* if not already seen */
strcpy (wstats[wcount].word, lccopy); /* copy to wstats */
wstats[wcount++].count++; /* increment count for word */
if (wcount == MAXW) { /* protect array bounds */
cerr << "maximum words reached: " << MAXW << "\n";
break;
}
}
}
f.open (STATOUT, ios::trunc); /* open TF4_2.txt */
if (!f.is_open()) { /* validate file open for writing */
cerr << "error: file open failed '" << STATOUT << "'.\n";
return 1;
}
for (int i = 0; i < wcount; i++) { /* loop over stored word stats */
/* output to stdout (optional) */
cout << " " << left << setw(16) << wstats[i].word <<
" " << wstats[i].count << "\n";
/* output to TF4_2.txt */
f << " " << left << setw(16) << wstats[i].word <<
" " << wstats[i].count << "\n";
}
f.close(); /* close TF4_2.txt */
}
Example Use/Output
$ ./bin/wordlenfreq
enter sentence (words 16 char or less): This is my test sentence. Is it fun?
This is my test sentence. Is it fun?
this 1
is 2
my 1
test 1
sentence 1
it 1
fun 1
Example TF4_1.txt
$ cat TF4_1.txt
This is my test sentence. Is it fun?
Example TF4_2.txt
$ cat TF4_2.txt
this 1
is 2
my 1
test 1
sentence 1
it 1
fun 1
While it is always good to master using the basic types, such as char
and learn to account for the character you fill and element indexes you store, you might as well write the code in C. Which would be a simple changing of the header file names, swapping fgets
or POSIX getline
for cin.get
, printf
(or fprintf
) for cout
and cerr
and fopen/fclose
for your file stream open/close operations.
With C++, the string
and vector
types can make your job much easier. It would handle string and struct storage requirements as well as insuring you do not write beyond the bounds of your storage. (but note: you would still require <cstring>
and strtok
because C++ getline
cannot delimit the string based on multiple delimiters)
Look things over and let me know if you have further questions.