The "%s" conversion specifier will match non-whitespace characters.
#define QUOTE(s) #s
#define STR(s) QUOTE(s)
#ifndef BUFSIZE
# define BUFSIZE 255
#endif
char buf[BUFSIZE+1];
while (fscanf(fin, "%" STR(BUFSIZE) "s", buf)) {
/* buf holds next word. Todo:
+ allocate space for word
+ copy word to newly allocated space
+ add to linked list
*/
}
Alternatively, strtok
can be use to tokenize (break up) a string into substrings, using a set of characters (as a character array) you specify. Your system may also have strsep
, which is intended to replace strtok
. Both strtok
and strsep
modify the array you pass in, so take care that this won't cause issues with other parts of the code that accesses the data. strsep
is not thread-safe; if you have multiple threads accessing the string to be parsed, use strsep
or strtok_r
.
#ifndef BUFSIZE
# define BUFSIZE 256
#endif
const char separators[] = "\t\n\v\r\f !\"#$%&'()*+,-./:;<=>?@[\\]^`{|}~";
char buf[BUFSIZE], *line, *word, *rest;
while (fgets(buf, BUFSIZE+1, fin)) {
rest = line = buf;
while ((word = strtok_r(line, separators, &rest))) {
/* Todo:
+ allocate space for word
+ copy word to newly allocated space
+ add to linked list
*/
line=rest;
}
}
Since the second example reads a line at a time from the file for strtok_r
to work on, if any line of the file is over BUFSIZE-1 characters long and the BUFSIZE-1st and BUFSIZEth characters in a line are both letters, the second example will split words in two. A solution to this would be to create a buffered string stream, so that when the end of the buffer is reached, anything remaining in the buffer is shifted to the front and the rest of the buffer is filled with more data from the file (just be careful about words longer than the buffer; in production code, it's a potential security vulnerability that could lead to denial of service attacks).
An issue with all of the above functions is they don't handle null characters in input. If you wish to parse data that may contain null characters, you'll need to use a non-standard function, which includes writing your own.
As for efficiency, any algorithm you use is going to need to read from the file (which is O(n) in complexity, and will require I/O, slowing down the program) and allocate memory to store the words. Whether you use fscanf
, strtok
or some other method, the time and space complexity isn't likely to vary much; about the only thing that might is how many intermediate buffers get allocated. Your best bet to find the most efficient implementation is to try a couple and profile them.