1

I have a huge files that I can't afford loading into memory and an byte sequence I need to find inside.

This is what I use now:

public static byte[] GetRangeFromStream(ref FileStream fs, long start_index, long count)
{
    byte[] data = new byte[count];
    long prev_pos = fs.Position;
    fs.Position = start_index;
    fs.Read(data, 0, data.Length);
    fs.Position = prev_pos;
    return data;
}

public static long GetSequenceIndexInFileStream(byte[] seq, ref FileStream fs, long index, bool get_beginning = true)
{
    if (index >= fs.Length)
        return -1;

    fs.Position = index;

    for (long i = index; i < fs.Length; i++)
    {
        byte temp_byte = (byte)fs.ReadByte();

        if (temp_byte == seq[0] && IsArraysSame(seq, GetRangeFromStream(ref fs, i, seq.Length))) //compare just first bytes and then compare seqs if needed
            return i;
    }

    return -1;
}
Kosmo零
  • 4,001
  • 9
  • 45
  • 88
  • 1
    Does you sequence fits in memory? If yes then read file in chunks and search your sequence in each chunk. Remember about handling case when sequence is split across two chunks. – janisz Oct 08 '16 at 13:21
  • No idea who has downvoted you or why - its a perfectly good technical question. – PhillipH Oct 08 '16 at 13:24
  • @janisz - Yes, it fits, but that way I will always compare whole `seq` and `chunk`, instead of comparing first byte and only then comparing seqs. Or maybe I understand you wrongly? – Kosmo零 Oct 08 '16 at 13:24
  • I would say there's probably not a slower method than the current one. It is in many ways faulted, but Code Review site would be more suitable place for reviewing it. In any case, janisz already gave an idea for the regular way of finding. – Sami Kuhmonen Oct 08 '16 at 13:24
  • How about using [KMP algorithm](https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm) with small change in while condition where missing parts of sequence will be loaded and start bytes diskarded? – janisz Oct 08 '16 at 13:34
  • @MickyD - this looks like an interesting thing. I will look what it can give me. – Kosmo零 Oct 08 '16 at 13:48
  • No problem. I accidentally deleted my comment. Anyway check out my answer below –  Oct 08 '16 at 14:07

3 Answers3

3

An optimal way of doing this is to read the file stream byte-by-byte only looking for the first byte of the string you are searching for. When you get a match, you know you have a possible hit. Read the next N bytes (where N is the length of the string being searched for) and do an array comparison on the contents of the bytes read from the file, and the bytes being searched for.

Let .net deal for the file read buffer by setting an appropriate FileStream buffer when you open it. Dont worry about reading forwards to create your own buffer - its a waste of time - .net does it fine.

This approach means you are not array comparing every byte and you don't care about splits across buffers etc.

If the file is really large and you are not I/O Bound then you could consider creating multiple Tasks usign the .net Task Library each one starting at a different location in the file stream, and correllating the results of each Task once all are completed.

PhillipH
  • 6,182
  • 1
  • 15
  • 25
  • I tested and don't understand anything. After I started to create FileStreams that way: `FileStream pf = new FileStream(tpls[i], FileMode.Open, FileAccess.Read, FileShare.Read, 256 * 1024 * 1024);` the search became dramatically slower O_o – Kosmo零 Oct 08 '16 at 13:45
  • 1
    It's not optimal solution. Comparing when first byte match will cause many unnecessary checks but basic idea is valid only need to incorporate observation that "searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin" – janisz Oct 08 '16 at 15:50
  • To avoid reading the entire byte stream into memory, you need to read it in buffered chunks. .net has a perfectly good buffered file stream which handles that. The checking of each byte read against the first byte in your buffer is neccessary whichever approach you take. You can skip the last N bytes where N is search term minus 1, but thats all you can do. If you want to build an index to the file content, then you can do - it will speed subsequent searches, but thats not whats being asked. – PhillipH Oct 08 '16 at 15:59
1

I propose to use modified version of Knuth Moriss Pratt algorithm. algorithm kmp_search: input: a stream of characters, S (the text to be searched) an array of characters, W (the word sought) output: an integer (the zero-based position in S at which W is found)

define variables:
    an integer, m ← 0 (the beginning of the current match in S)
    an integer, i ← 0 (the position of the current character in W)
    an array of integers, T (the table, computed elsewhere)

while m + i < length(S) do
    if W[i] = S[m + i] then
        if i = length(W) - 1 then
            return m
        let i ← i + 1
    else
        if T[i] > -1 then
            let m ← m + i - T[i], i ← T[i]
        else
            let m ← m + 1, i ← 0

(if we reach here, we have searched all of S unsuccessfully)
return the length of S

The text string can be streamed in because the KMP algorithm does not backtrack in the text. (This is another improvement over the naive algorithm, which doesn’t naturally support streaming.) If streaming, the amortized time to process an incoming character is Ɵ(1) but the worst-case time is Ɵ(min(m, n′)), where n′ is the number of text characters seen so far. Source

Referecne (Java) implementation could be found here

package com.twitter.elephantbird.util;

import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;

/**
 * An efficient stream searching class based on the Knuth-Morris-Pratt algorithm.
 * For more on the algorithm works see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
 */
public class StreamSearcher {

  protected byte[] pattern_;
  protected int[] borders_;

  // An upper bound on pattern length for searching. Results are undefined for longer patterns.
  public static final int MAX_PATTERN_LENGTH = 1024;

  public StreamSearcher(byte[] pattern) {
    setPattern(pattern);
  }

  /**
   * Sets a new pattern for this StreamSearcher to use.
   * @param pattern
   *          the pattern the StreamSearcher will look for in future calls to search(...)
   */
  public void setPattern(byte[] pattern) {
    pattern_ = Arrays.copyOf(pattern, pattern.length);
    borders_ = new int[pattern_.length + 1];
    preProcess();
  }

  /**
   * Searches for the next occurrence of the pattern in the stream, starting from the current stream position. Note
   * that the position of the stream is changed. If a match is found, the stream points to the end of the match -- i.e. the
   * byte AFTER the pattern. Else, the stream is entirely consumed. The latter is because InputStream semantics make it difficult to have
   * another reasonable default, i.e. leave the stream unchanged.
   *
   * @return bytes consumed if found, -1 otherwise.
   * @throws IOException
   */
  public long search(InputStream stream) throws IOException {
    long bytesRead = 0;

    int b;
    int j = 0;

    while ((b = stream.read()) != -1) {
      bytesRead++;

      while (j >= 0 && (byte)b != pattern_[j]) {
        j = borders_[j];
      }
      // Move to the next character in the pattern.
      ++j;

      // If we've matched up to the full pattern length, we found it.  Return,
      // which will automatically save our position in the InputStream at the point immediately
      // following the pattern match.
      if (j == pattern_.length) {
        return bytesRead;
      }
    }

    // No dice, Note that the stream is now completely consumed.
    return -1;
  }

  /**
   * Builds up a table of longest "borders" for each prefix of the pattern to find. This table is stored internally
   * and aids in implementation of the Knuth-Moore-Pratt string search.
   * <p>
   * For more information, see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
   */
  protected void preProcess() {
    int i = 0;
    int j = -1;
    borders_[i] = j;
    while (i < pattern_.length) {
      while (j >= 0 && pattern_[i] != pattern_[j]) {
        j = borders_[j];
      }
      borders_[++i] = ++j;
    }
  }
}

Similar question: Efficient way to search a stream for a string

Community
  • 1
  • 1
janisz
  • 6,292
  • 4
  • 37
  • 70
1

You might want to check out file mapping. It allows you to essentially treat large files as if they were memory buffers thus allowing to use any memory-based API on a disk file. No mucking about with explicit file I/O.

MSDN:

File mapping is the association of a file's contents with a portion of the virtual address space of a process. The system creates a file mapping object (also known as a section object) to maintain this association. A file view is the portion of virtual address space that a process uses to access the file's contents. File mapping allows the process to use both random input and output (I/O) and sequential I/O. It also allows the process to work efficiently with a large data file, such as a database, without having to map the whole file into memory. Multiple processes can also use memory-mapped files to share data....tell me more...