0

I have a very large file (100 MB) with strings in it, and I am searching a performant way to query if a given string is available in the file. The whole line should be compared against the input string.

My idea is that a program loads the file, and after that, it can be queried if the string exists or not.

void loadfile(string filename);
bool stringAvailable(string str);

The loadfile() function does not need to be fast, since it is called occasionally. But stringAvailable() needs to be as performant as possible.

At the moment I have tried:

1. Let the linux command line tools do the job for me:

system("cat lookup | grep \"^example$\"");

=> Not very fast.

2. Having a MySQL database instead of a file (I tried MyISAM and InnoDB) and query it like SELECT count(*) FROM lookup WHERE str = 'xyz'

=> Very fast, but it could be still faster. Also, it would be better to have a program which is not dependent on a DBMS.

3. Having an array of strings (string[] ary) and compare all values in a for() loop.

=> Not very fast. I guess it can be optimized with hashtables, which I am currently experimenting.

Are there other possibilities to make the process even more performant?

Daniel Marschall
  • 3,739
  • 2
  • 28
  • 67
  • Why would you need to load the whole file first, instead of reading a line, compare it, continue? – πάντα ῥεῖ Aug 28 '16 at 11:29
  • Because I will query the data more than once, and because disk access is slower than memory access, I work in the memory. (And note: It is mainly a lookup file which is changed rarely, so I only re-load it a few times in a month) – Daniel Marschall Aug 28 '16 at 11:33
  • OK, that requires to read block wise in, and search in memory right. You may check if your OS offers a memory mapped file feature, that might be fastest. – πάντα ῥεῖ Aug 28 '16 at 11:35
  • Possible duplicate of [What is the Fastest Method for High Performance Sequential File I/O in C++?](http://stackoverflow.com/questions/1201261/what-is-the-fastest-method-for-high-performance-sequential-file-i-o-in-c) – πάντα ῥεῖ Aug 28 '16 at 11:42
  • Your second (and third) solution makes me think that there is more about the string to be searched than just "is present in the file", you probably haven't fed the database with the 5000T strings possible (100M strings of length varying from 1 to 100M characters). Could you say more about that, performance often depend on taking advantage of the characteristics of the input. – AProgrammer Aug 28 '16 at 11:45
  • 1
    @πάνταῥεῖ - no, it's not. His Q and comments indicate he's loading the entire file in memory few time a month - nothing about sequential file IO (perhaps except a `void loadfile(string filename);` - hardly a matter of optimisation). – Adrian Colomitchi Aug 28 '16 at 11:50
  • @AdrianColomitchi If I would have been sure, I'd hammered the question. – πάντα ῥεῖ Aug 28 '16 at 11:51
  • An example of a misleading question as you do not necessarily want to read a file at all. Instead it seems you are in need of a look up table https://en.wikipedia.org/wiki/Lookup_table – user2672165 Aug 28 '16 at 12:38
  • @user2672165 There can be approaches, like the command line invocation, or the memory mapped files, which might work too. I just focused on lookup-tables, because I guess it is the fastest way. – Daniel Marschall Aug 28 '16 at 13:24
  • Possible duplicate of [How can I build a lookup table in C++?](http://stackoverflow.com/questions/3819030/how-can-i-build-a-lookup-table-in-c) – user2672165 Aug 28 '16 at 16:09

2 Answers2

1

Store all the lines from the file in a std::unordered_set.

#include <iostream>
#include <unordered_set>
#include <string>

int main(int argc, char **argv)
{
    std::unordered_set<std::string> lines;
    lines.insert("line 1");
    lines.insert("line 2");

    std::string needle = argv[1];
    if (lines.find(needle) != lines.end())
        std::cout << "found\n";
    else
        std::cout << "NOT found\n";

    return 0;
}
Andriy Tylychko
  • 15,967
  • 6
  • 64
  • 112
Roland Illig
  • 40,703
  • 10
  • 88
  • 121
0

First of all load the file into the memory. I'm guessing you have enough.

Then I would try a linear search within the memory. If you start looking for the first character stop there and look for the consecutive characters you are looking for. If they consecutive characters do not match continue searching with the first character and so on.

Does the file has to have a pattern or be sorted at certain conditions. If that's the case you might have chances to optimizes even further.

Also try to use string references like this:

void loadfile(const string &filename);
bool stringAvailable(const string &str);

It might avoid unnecessary copies.

Gerhard Stein
  • 1,543
  • 13
  • 25