0

So I'm looking for a way to efficiently search for text in a file. Right now I'm using this:

using (FileStream fileStream = new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.Read, 1024 * 1024, FileOptions.SequentialScan))
using (StreamReader streamReader = new StreamReader(fileStream))
{
    string line;
    while ((line = streamReader.ReadLine()) != null)
    {
        int index = 0;
        while ((index = line.IndexOf(searchText, index, StringComparison.Ordinal)) != -1)
        {
            index += searchText.Length;
        }
    }
}

However, I was wondering if there was a way to more efficiently search the file. I was thinking of maybe searching for the text in buffers, but I'm not sure how. Thanks.

EDIT: Without calling IndexOf, I get around 1600ms. With index of, it's around 7400ms.

EDIT: I have a basic implementation of chunk reading, and it got the time down to 740ms. (No reading lines) It still has lots of work, but I basically read a chunk at a time and take index of.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
VS-ux
  • 41
  • 1
  • 7
  • 1
    It depends on what the bottleneck is, if its the file read, well you are not going to get much faster apart from adjusting buffer size. – TheGeneral Aug 04 '20 at 05:47
  • 1
    Does [this answer](https://stackoverflow.com/questions/13959429/c-sharp-searching-large-text-file) help? – Hayden Aug 04 '20 at 05:47
  • @TheGeneral I increment index because I want to find all occurrences in that string. – VS-ux Aug 04 '20 at 05:49
  • @VS-ux yeah i figured that eventually. – TheGeneral Aug 04 '20 at 05:51
  • @Hayden IndexOf is already faster than Boyer-Moore algorithm. I think it's because I'm calling IndexOf too many times. I probably have to to it on a large buffer. – VS-ux Aug 04 '20 at 05:52
  • You could also break this into parallel workloads probably if you are reading large buffers, however you will hit an IO bottle neck eventually – TheGeneral Aug 04 '20 at 05:52
  • I tried parallelizing it, but it seemed even slower. (Parallelizing File.ReadLines) – VS-ux Aug 04 '20 at 05:53
  • Do you care what line the search term is on? – TheGeneral Aug 04 '20 at 05:54
  • No, but possibly in the future – VS-ux Aug 04 '20 at 05:56
  • @VS-ux parallelising the file read won't help you. you might have 2,4,8,16 or 128 cores to calculate stuff - but only _one_ storage device from which to read. – Franz Gleichmann Aug 04 '20 at 05:56
  • @FranzGleichmann Well I think you could use MMF, but it might still be IO bound – VS-ux Aug 04 '20 at 05:57
  • 1
    Memory mapped files or not, your file reads will likely be slower than the actual searching (if i were to take a guess) – TheGeneral Aug 04 '20 at 05:58
  • @TheGeneral Without calling IndexOf, I get around 1600ms. With index of, it's around 7400ms. – VS-ux Aug 04 '20 at 05:59
  • The data can be whatever the user wants, and the search terms can be anything aswell. Also I tested your code and Parallelizing doesn't really help in this scenario. – VS-ux Aug 04 '20 at 06:07
  • I repeat TheGeneral. Even though your usage of `IndexOf` is a bit slower, you should first profile if your file-access is not even slower? Why should you optimize some milliseconds for `IndexOf`, when your actual bottleneck is far bigger. Don´t hunt for every optimization, just make those that **matter**. – MakePeaceGreatAgain Aug 04 '20 at 06:16
  • @HimBromBeere Without calling IndexOf, I get around 1600ms. With index of, it's around 7400ms. However, I think I shouldn't call indexof for every line. I should instead read buffers, and then extend to next line if needed. – VS-ux Aug 04 '20 at 06:17
  • @TheGeneral I tested it with the standbylist (file cache) cleared. Also, my test file is about 400MB large but every line is only like 5 bytes large. That equates to many lines. – VS-ux Aug 04 '20 at 06:26
  • @TheGeneral I mean that instead of calling IndexOf for every line, I call it only every 1MB chunk. So if there are 10000 lines in 1MB, then I skip 9999 calls to IndexOf – VS-ux Aug 04 '20 at 06:32
  • 1
    @VS-ux, you are not going to get answer here, even if someone could come up with a good one its likely going to be faster or slower depending on the particular file read, and whether you want to do this by line (and other considerations). Also there are variables like your cpu speed and hdd speed which will play in to the equation. If i were you (and i am sure you are capable) just keep on bench-marking, and you will come to the best solution for you. (note, i delete most of the comments as this was getting out of hand) – TheGeneral Aug 04 '20 at 06:36
  • Alright. Thank you TheGeneral. I will keep trying. – VS-ux Aug 04 '20 at 06:37

1 Answers1

1

Your approach from the performance point of view will be O(xl) time, where x is the length of the string being searched and l the length of the string you are trying to find. There are few general algorithms that you can apply:

  • Boyer-Moore
  • Morris-Pratt
  • Knuth-Morris-Pratt

I recommend you to use Boyer-Moore and here you have examples on how to implement it: https://www.geeksforgeeks.org/boyer-moore-algorithm-for-pattern-searching/

D A
  • 1,724
  • 1
  • 8
  • 19
  • I saw in many tests that IndexOf was still actually faster than Boyer-Moore – VS-ux Aug 04 '20 at 06:02
  • The current implementation of String.IndexOf(string) uses a naive loop to check whether a substring of some given text matches a pattern. Although this implementation is simple and easy to understand, it is very inefficient because it has to iterate through a minimum of m - n + 1 characters of the text, where m is the length of the text and n is the length of the pattern string. – D A Aug 04 '20 at 06:06
  • However in many tests like this one: http://www.blackbeltcoder.com/Articles/algorithms/fast-text-search-with-boyer-moore IndexOf outperformed boyer – VS-ux Aug 04 '20 at 06:10
  • I tested boyer moore against larger chunks (in favor of boyer), however, it still lost by almost half. – VS-ux Aug 04 '20 at 06:15
  • Boyer-Moore is demonstrated as one of the fastest algorithm that works. https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm. I don't know what code implementation your present and in what contexts but from theoretical point of view Boyer-Moore stands as the best approach. – D A Aug 04 '20 at 06:20
  • You're correct. However, for smaller chunks of data (I don't want to read entire file into memory), like even 1MB large, IndexOf is almost 2x faster (With StringComparison.Ordinal) – VS-ux Aug 04 '20 at 06:25
  • @VS-ux if you are running on Windows, `IndexOf` calls https://learn.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-findnlsstringex, I wouldn't be surprised to find that this is an implementation of the Boyer-Moore Algorithm with the Galil rule. – Jodrell Oct 04 '22 at 10:06