2

I have a dictionary .txt file with probably over a thousand words and their definitions. I've already written a program to take the first word of each line from this file and check it against a string input by the user:

void checkWord(string input)
{

    std::ifstream inFile;
    inFile.open("Oxford.txt");
    if (inFile.is_open())
    {
        string line; //there is a "using std::string" in another file
        while (getline(inFile, line))
        {
            //read the first word from each line
            std::istringstream iss(line);
            string word;
            iss >> word;
            //make sure the strings being compared are the same case
            std::transform(word.begin(), word.end(), word.begin(), ::tolower);
            std::transform(input.begin(), input.end(), input.begin(), ::tolower);
            if (word == input)
            {
                //Do a thing with word
            }
        }
        inFile.close();
        return "End of file";
    }
    else
    {
        return "Unable to open file";
    }

}

But if I'm checking more than a sentence, the time it takes to process becomes noticeable. I've thought about about a few ways of making this time shorter:

  • Making a .txt file for each letter of the alphabet (Pretty easy to do, but not really a fix in the long-term)
  • Using unordered_set to compare the strings (like in this question) the only problem with this might be the initial creation of these maps from the text file
  • Using some other data structure to compare strings? (Like std::map)

Given that the data is already "sorted", what kind of data structure or method should I employ in order to (if possible) reduce time complexity? Also, are there any issues with the function I am using to compare strings? (for example, would string::compare() be quicker than "=="?)

Community
  • 1
  • 1
Lucas Saldyt
  • 359
  • 2
  • 13
  • `std::ifstream` is known to be pretty slow, you may consider an alternative. – Slava Jul 23 '15 at 20:40
  • 1
    Does searching through a 1000 words really take that long? Some REALLY simple improvements would be to lower-case the file before reading it, and lowercasing the input word only once at the beginning of the function. And if you search multiple times, load the contents of the file into memory and search the loaded list rather than reading the file many times. – Mats Petersson Jul 23 '15 at 20:40
  • @MatsPetersson: Looks like an _answer_ to me!! – Lightness Races in Orbit Jul 23 '15 at 20:42
  • I don't see the need to transform `input` to lower case on every iteration. May not bring any improvement since a good compiler is likely to optimize this part anyway, but it just makes for better looking code. Also, I believe that the `is >> word` may be quite an expensive operation compared with a string tokenization approach (see http://www.cplusplus.com/reference/cstring/strtok/ or http://stackoverflow.com/questions/53849/how-do-i-tokenize-a-string-in-c). People more familiar with the inner of C++ may want to comment on this. – Alain Jul 23 '15 at 20:45
  • Thanks to all of you! – Lucas Saldyt Jul 23 '15 at 20:46

4 Answers4

6

A tree (std::map)? Or a hashmap (std::unsorted_map)? Your linear search is obviously a brute force solution! Both of the above will be substantially superior for multiple searches.

Of course, that only really helps if you are going to use this data multiple times per program run, which you didn't specify in your question. If not, there's not really much benefit in loading and parsing and storing all the data only to perform a single lookup then quit. Just put a break in on success, at least.

You imply that your input file is sorted. You could hack together a binary search solution with file seeking (which is really cheap) and snapping to the nearest newline on each iteration to determine roughly where all the words with the same leading (say) three characters are in your file. For a thousand entries, though, this is probably overkill.

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
  • 1
    Also hash values calculated from strings might help. – πάντα ῥεῖ Jul 23 '15 at 20:34
  • Honestly, I'm pretty new to data structures in general, so, I'm not sure which of those to look into? Also, I'll probably be calling this function fairly often throughout the course of the program, so the less time it takes, the better. By saying that my file is sorted, I mean that the words are listed from A to Z, so I'll look into a binary search too? (Thanks) – Lucas Saldyt Jul 23 '15 at 20:38
  • @Lucas: Look into both of them? – Lightness Races in Orbit Jul 23 '15 at 20:41
  • 1
    @LucasSaldyt if you lookup only once pre program run data structure would not help, only more efficeient way to find the word in the file. – Slava Jul 23 '15 at 20:41
  • @LucasSaldyt for multiple run read into `std::unordered_map` (unless you need them to be ordered for whatever reason, then use std::map) and then look there. – Slava Jul 23 '15 at 20:44
1

So, there are "simple" fixes, and there are some more complex ones.

The first step is to move all unnecessary things out of the search-loop: Lowercase input once, before the loop, rather than every time - after all, it's not changing. If possible, lowercase the Oxford.txt too, so you don't have to lowercase word for every line.

If you are searching the file multiple times, reading a file multiple times is definitely not a great solution - even if it's cached in the filesystem the second time.

So reading it once into some container, really simple one would be std::vector [and lower-case the string at the same time] and just iterating over it. The next improvement would be to sort the vector and us a binary search (but you'd have to write the binary search yourself - it's not terribly hard)

A slightly more complex solution [but faster to search] would be to use std::map<std::string, std::string> wordlist (but that also takes a bit more space), then use auto pos = wordlist.find(input); if (pos != wordlist.end() ... found word ....

Mats Petersson
  • 126,704
  • 14
  • 140
  • 227
1

You can benefit from using a prefix tree, also known as a trie data structure, as it fits the use case of having a dictionary and frequently looking up words in it.

The simplest model of a trie is a tree where each node holds a letter and a flag to tell whether the current letter is the end of a word (and, additionally, pointers to other data about the word).

Example picture of a trie containing the dictionary aback abate bid bird birth black blast:

Trie programmer art

To search for a word, start from the root, and for each letter of your word, follow the node containing the current letter (halt if it isn't present as a child of the current node). The search time is proportional to the look up word length, instead of to the size of your dictionary.

A trie also allows you to easily get the alphabetic (lexicographical) order of words in a dictionary: just do a pre-order traversal of it.

Alexandre
  • 386
  • 7
  • 15
0

Instead of storing everything in a .txt file, store it in a real database.

SQLite3 is a good choice for simple tasks, since it is in-process instead of requiring an external server.

For a very simple, the C API and SQL statements should be very easy to learn.

Something like:

-- Only do this once, for setup, not each time you run your program.
sqlite> CREATE TABLE dictionary (word TEXT PRIMARY KEY);
sqlite> .import /usr/share/dict/words dictionary;
-- Do this every time you run your program.
sqlite> select count(*) from dictionary where word = 'a';
1
o11c
  • 15,265
  • 4
  • 50
  • 75
  • I hadn't really thought of this, but I'll try it out, too. (I've never worked with a data base before, so I appreciate the concise approach to this) – Lucas Saldyt Jul 23 '15 at 20:54