C++, Algorithm for searching for a word on a line in a large file?

Question

I'm trying to find the best method to look up a which lines in a large file contain a certain word.

For example if you had the following file:

cat dog monkey 
banana chair elephant 
monkey phone platypus cat

I'd like it to be able to return 0, 2 for "cat"

I'd expect the function prototype to look something like this:

std::vector<int> FindWords(std::string word);

I'd like to pre-process the file into some data structure, so that lockups can be done quickly giving the line numbers that word is contained on. I'm aware std::map could do this if there was only one instance of the word but there are more.

What is the most appropriate algorithm for this?

You may need to define what a word is and how to handle compounds: cat ate a pill, got ill caterpillar cater pillar. — greybeard, Mar 13 '15 at 19:22

sray · Answer 1 · 2015-03-13T14:43:14.580

Build a trie data structure for all the unique words in the file.

For each word in the trie, store the list of line numbers where the word is present in the file. This can be done in a single pass through the file.

You could use a map also to store the list of line numbers for each word, but a trie would be more compact.

C declarations for trie data structure added below. This should give you an idea of how to get started if you wanted to implement yourself.

/*
 * TRIE data structure defined for lower-case letters(a-z)
 */
typedef struct trie {
  char c;                           /* Letter represented by the trie node */
  struct trie *child[26];           /* Child pointers, one for each of the 26 letters of the alphabet */
  bool isTerminal;                  /* If any word ends at that node, TRUE, else FALSE */
  int counts;                       /* Number of lines the word ending at node occurs in the text */
  int lines[MAX_NUM];               /* Line numbers of the word occurences in the text */
} trie;

/*
 * Insert a word into the trie.
 * word - Word which is being inserted
 * line - Line number of word in the text.
 */
void insertToTrie(trie *node, const char *word, int line);

@SvaLopLop You should used some existing trie implementation, rather than rolling your own. It isn't that hard to get the basics but there are some non-trivial optimization for performance. no need to reinvent the wheel.See here. there are several small lirbaries that provide tries in C++. http://stackoverflow.com/a/28758337/1098041 — Félix Cantournet, Mar 13 '15 at 14:33
Added trie data structure and function for C implementation to give you an idea. — sray, Mar 13 '15 at 14:43

W.F. · Answer 2 · 2015-03-13T14:36:25.190

0

You could also use std::multimap or if even better std::unordered_multimap as you don't need to iterate through entire map collection only on the elements of a certain value.

Edit: Simple example:

#include <iostream>
#include <unordered_map>

int main() {
   std::unordered_multimap<std::string, int> mymap;
   mymap.insert(std::pair<std::string, int>("word", 1));
   mymap.insert(std::pair<std::string, int>("anotherword", 2));
   mymap.insert(std::pair<std::string, int>("word", 10));
   for (auto it = mymap.find("word"); it != mymap.end() && it->first == "word"; it++) {
      std::cout << it->second << std::endl;
   }

}

edited Mar 13 '15 at 14:36

answered Mar 13 '15 at 14:27

W.F.

13,888
2
34
81

This method has to be faster? and simpler than the trie one. It also has the advantage of using stl parts. – SvaLopLop Mar 13 '15 at 16:14
1

I don't think the method will be much quicker, because trie structure is a kind of index that in a linear time (to the word length) can answer in which document the word was used -- in your example in which line. But definitly it is the simpler approach. On the other hand trie will have more complicated structure and as so higher memory complexity. But will be able to answer more sophisticated queries for example which line contain the word starting with some prefix... It really depends on your needs which approach should you take. – W.F. Mar 13 '15 at 16:28
PS. Of course this method will also have the linear complexity (to the word length) because it does string hashing :) – W.F. Mar 13 '15 at 16:30
Trie shares nodes for common prefixes of the words. In that way, a trie has potential to be more compact for large number of words – sray Mar 13 '15 at 18:22

score 0 · Answer 3 · answered Mar 13 '15 at 16:20

0

Boyer–Moore string search algorithm is faster then a trie when you are searching for a single string. Most likely you can modify it for multiple string.

answered Mar 13 '15 at 16:20

Micromega

12,486
7
35
72

Boyer moore is useful for searching a single string. But when multiple words need to be looked up, a trie would be more useful. The trie is built once and for all by doing a single pass through the file. Word searches need not refer to the original large file anymore. – sray Mar 13 '15 at 16:25
@sray:Imo you are wrong. It about the search not the file. – Micromega Mar 13 '15 at 16:29
@Phpdna: I/O is always slower than searching memory. Thus it could be about file I/O. – Thomas Matthews Mar 13 '15 at 17:47
@ThomasMatthews and sray:Why would you build a trie from the file? – Micromega Mar 13 '15 at 18:29
So that you can do lookups for multiple words. – sray Mar 13 '15 at 18:37
@sray:There isn't a need for a trie. – Micromega Mar 13 '15 at 18:52

C++, Algorithm for searching for a word on a line in a large file?

3 Answers3