0

I need to count the number of times some expressions (strings of words, can be any length and any number of words) appear in large text inputs. There can be hundreds of thousands of these expressions (and more), that are all saved in a db in advance.

What would be a good efficient approach to achieve this? I'd prefer it to be implemented as "asynchronously" as possible.

My current idea is to create a large regular expression with as much expressions as I can as patterns (separated with an | operator) - and simply run it and count the matches. Are there any other options? If so what are their advantages? Perhaps there is a certain query in the SQL level that I can run, that I'm missing?

Yuval A.
  • 5,849
  • 11
  • 51
  • 63
  • 1
    Not sure what your language is, but the answer might probably be a regex trie. See [this answer](http://stackoverflow.com/a/42789508/3832970) for a hint. – Wiktor Stribiżew May 11 '17 at 23:33

1 Answers1

0

I think you should use a hash table for this. A regular expression for this situation would be cumbersome and computationally expensive compared to a simple hash table. A hash table is an associative array that holds key-value pairs. You can match a key (a string, in your example) to a number (the number of times it's appeared in the input next). Simply place all of the strings from your database into the hash table as keys with values of zero. Check every string in the input text to see if it's in the hash table. If it is, increment the value, if not, do nothing. C++, Java, C#, Python and most other general purpose languages have an implementation of the hash table. I wrote a simple program in C++ which demonstrates this capability:

#include<iostream>
#include<unordered_map>
#include<string>
#include<fstream>
int main()
{
        std::unordered_map<std::string, int> map;
        std::ifstream matches("matches.txt");
        std::ifstream input("input.txt");
        std::string in;
        while(matches>>in){
                map[in] = 0;
        }
        while(input>>in){
                if(map.find(in) != map.end())
                        ++map[in];
        }
        for(auto i : map)
                std::cout<<i.first<<" "<<i.second<<std::endl;
        return 0;
}

This C++ code creates a hash table (called an unordered_map). It then reads in "matches", which represents the patters from your DB, and adds them to the table with initial keys of zero. It reads in inputs from the "inputs" stream, and checks to see if they are in the hash table. If they are, it increments the key value. Then the program prints the elements of the table with the number each appears in key-value order.

Jayson Boubin
  • 1,504
  • 2
  • 11
  • 19