Search for string in large text file

Question

I'm trying to search for sub-string in large text file.

I found David Heffernan buffered disk access unit.

So i use it like this :

function _FindStrInFile(const AFileName, SubStr: string): string;
var
  Stream   : TReadOnlyCachedFileStream;
  SL       : TStringList;
  I        : Integer;
begin
  Result   := '';
  Stream   := TReadOnlyCachedFileStream.Create(AFileName);
  try
    Stream.Position := 0;
    SL := TStringList.Create;
    try
      SL.LoadFromStream(AFileName);
      for I := 0 to SL.Count-1 do begin
        if Pos(SubStr, SL[I]) > 0 then begin
          Result := SL[I];
          Break;
        end;
      end;
    finally
      SL.Free;
    end;
  finally
    Stream.Free;
  end;
end;

i'm not sure if i use it correctly (buffered disk access) or i make it useless when i load the stream into TStringList and loop thru it.

because i calculate the time consumed of the above method and the below: and i found the below is faster in milliseconds for the tested file.

function _FindStrInFile(const AFileName, SubStr: string): string;
var
  SL       : TStringList;
  I        : Integer;
begin
  Result   := '';
  SL := TStringList.Create;
  try
    SL.LoadFromFile(AFileName);
    for I := 0 to SL.Count-1 do begin
      if Pos(SubStr, SL[I]) > 0 then begin
        Result := SL[I];
        Break;
      end;
    end;
  finally
    SL.Free;
  end;
end;

Any suggestions/guidance to improve the first function ?

What if the searched text will be splitted by the currently loaded chunk and a subsequent one? I mean, try to search for the text "repeat" if this kind of streaming loads you chunks in two parts like "rep" and "eat", what to do then? My suggestion here is, if the file cannot be parsed, do not search for a text by reading chunks whilst streaming. — Victoria, May 24 '17 at 02:39
You throw away the advantage of the buffered stream the second you load it into the stringlist. You're no longer using that buffered stream at all; TStringList.LoadFromStream loads the entire stream content into memory. Get rid of TStringList entirely, and search the stream itself. — Ken White, May 24 '17 at 02:50
True Ken, but as it seems the text file is being streamed here. Yeah , solution is to "remember" what has already been read and what follows in in stream but it depends on the length of a text to be seached. — Victoria, May 24 '17 at 03:19
@VictoriaMarotoSilva a solution wpuld be to use `TStreamReader` instead. It takes a `TStream` as input, so you can still use buffered file I/O, and it has a `ReadLine()` method, so you can still check individual lines without worrying about buffer chunking. — Remy Lebeau, May 24 '17 at 06:52
TStreamReader has its own performance problems and personally I wrote my own replacement to deal with those. — David Heffernan, May 24 '17 at 07:14
@Victoria: No, it's not. Read the VCL Classes.pas source for `TStrings.LoadFromStream`. It immediately reads the entire stream into memory. Nothing is buffered by the stream that opened the file in the first place. As I said, any benefit that might have been gained by using a buffered stream is immediately discarded when it's read into the stringlist. — Ken White, May 24 '17 at 12:31
@VictoriaMarotoSilva well, then `TStringList` would have had even more severe memory issues to begin with. But why do you have a large text file without any line breaks in it? What kind of file is it? — Remy Lebeau, May 24 '17 at 16:08
@Remy, tax return form maybe :) Just a theory, practically cannot think about any. — Victoria, May 24 '17 at 16:27
@VictoriaMarotoSilva tax return forms are not plain text files, they are usually PDFs instead. You shouldn't use string processing functions on PDFs, use an actual PDF API instead. — Remy Lebeau, May 24 '17 at 16:32

Search for string in large text file

0 Answers0