3

I have come up with this function to return the number of occurrences of a string in a Delphi Stream. However, I suspect there is a more efficient way to achieve this, since I am using "higher level" constructs (char), and not working at the lower byte/pointer level (which I am not that familiar with)

function ReadStream(const S: AnsiString; Stream: TMemoryStream): Integer;
var
  Arr: Array of AnsiChar;
  Buf: AnsiChar;
  ReadCount: Integer;

  procedure AddChar(const C: AnsiChar);
  var
    I: Integer;
  begin
    for I := 1 to Length(S) - 1 do
      Arr[I] := Arr[I+1];
    Arr[Length(S)] := C;
  end;

  function IsEqual: Boolean;
  var
    I: Integer;
  begin
    Result := True;
    for I := 1 to Length(S) do
      if S[I] <> Arr[I] then
      begin
        Result := False;
        Break;;
      end;
  end;

begin
  Stream.Position := 0;
  SetLength(Arr, Length(S));
  Result := 0;
  repeat
    ReadCount := Stream.Read(Buf, 1);
    AddChar(Buf);
    if IsEqual then
      Inc(Result);
  until ReadCount = 0;
end;

Can someone supply a procedure that is more efficient?

RaelB
  • 3,301
  • 5
  • 35
  • 55
  • 3
    You'd be hard pressed to make it less efficient. Reading bytes one at a time? Bad. Copying byte by byte. Bad. Compering byte by byte. Bad. Read into a large buffer. Don't limit usage to memory stream, or if you do take advantage and work with the memory directly. Use Move and CompareMem for their optised implementations. Don't copy so much. Rather than shifting Arr down, compare starting from the next byte. Be aware that your function is not locale sensitive, that might be fine. – David Heffernan May 27 '17 at 19:48
  • Essentially I am asking how to find a string in a stream. So probably the best would be to compare the string memory with the stream memory and not use a buffer at all. Is a buffer necessary? I do not know how to compare the string memory with stream memory. I wanted to make a direct comparison between Arr and S (if Arr = S then Inc(Result)), however, that would not compile. So how could I compare Arr and S if not using byte by byte? – RaelB May 27 '17 at 21:51
  • CompareMem like I said. But you aren't working with a stream. That is generic. You are working with a memory stream. Specific. If you work with memory stream then sure, compare with memory directly. – David Heffernan May 27 '17 at 21:55
  • You also have some significant bugs. 1) When `Stream.Read(Buf, 1);` returns "`0` bytes read" you still process `Buf` one last time. This can in the right conditions lead to over-counting the number of matches. 2) You don't initialise the contents of `Arr`. So again, in the right conditions you could count matches that include uninitialised data. – Disillusioned May 28 '17 at 03:09
  • Since the purpose of your function is to "(count) the number of occurrences of a string" why do you name it `ReadStream`? This is only going to make client code calling this function extremely difficult to read. – Disillusioned May 28 '17 at 03:16
  • I'm voting to close this question as off-topic because performance related questions without clear and verifiable current benchmarks vs. desired performance goals cannot be definitively answered. – Disillusioned May 28 '17 at 03:19
  • @MartynA You need to make sure that the buffer is null terminated – David Heffernan May 28 '17 at 09:39
  • I think it's a reasonable question. Obviously the is no such thing as the fastest code (`TANSTATFC`), but asking how to improve running time of code seems perfectly on-topic to me and OP has shown enough effort. So +1 from me. – Johan May 28 '17 at 11:11

2 Answers2

4

Stream has a method that will let you get into the internal buffer.

You can get a pointer to the internal buffer using the Memory property.

If you are working in 32 bit and you are willing to let go of the deprecated TMemoryStream and use TBytesStream instead you can use abuse the fact that a dynamic array and an AnsiString share the same structure in 32 bit.
Unfortunately Emba broke that compatibility in X64, Which means that for no good reason whatsoever you cannot have strings > 2GB in X64.

Note that this trick will break in 64 bit! (See fix below)

You can use Boyer-Moore string searching.

This allows you to write code like this:

function CountOccurrances(const Needle: AnsiString; const Haystack: TBytesStream): integer;
var
  Start: cardinal;
  Count: integer;
begin 
  Start:= 1;
  Count:= 0;
  repeat
    {$ifdef CPUx86}
    Start:= _FindStringBoyerAnsiString(string(HayStack.Memory), Needle, false, Start);
    {$else}
    Start:= __FindStringBoyerAnsiStringIn64BitTArrayByte(TArray<Byte>(HaySAtack.Memory), Needle, false, Start);
    {$endif}
    if Start >= 1 then begin
      Inc(Start, Length(Needle));
      Inc(Count);
    end;
  until Start <= 0;
  Result:= Count;
end;

For 32 bit you'll have to rewrite the BoyerMoore code to use AnsiString; a trivial rewrite.
For 64 bit you'll have to rewrite the BoyerMoore code to use a TArray<byte> as a first parameter; a relatively simple task.

If you are looking for efficiency, please try and avoid WinAPI calls that use pchars. c-style strings are a horrible idea, because they do not have a length prefix.

Johan
  • 74,508
  • 24
  • 191
  • 319
  • I tested and it is very fast. However, I would like this to work for D7, so I need it to work with TMemoryStream. (In your code the typecast should be to `AnsiString` - `_FindStringBoyerAnsiString(AnsiString(HayStack.Memory)`.) – RaelB May 28 '17 at 23:35
2

Johan has given you a good answer about Boyer-Moore searching. BM is fine if your are content to use it as a "black box", but if you want to understand what's going on, there is a bit of a gulf between the complexity of your own code and a BM implementation.

You might find it helpful to explore searching that's more efficient than your own code but not so complex as BM. There is one ultra-simple way to do what you want without getting invoved with pointers, PChars, etc.

Let's leave aside for a moment the fact that you want to work with a TMemoryStream, and consider finding the number of occurrences of a string SubStr in another string Target.

For efficiency, things you want to avoid are a) repeatedly scanning the same characters over and over and b) copying one or both strings.

Since D7, Delphi has included a PosEx function:

function PosEx(const SubStr, S: string; Offset: Cardinal = 1): Integer; Description PosEx returns the index of SubStr in S, beginning the search at Offset. If Offset is 1 (default), PosEx is equivalent to Pos. PosEx returns 0 if SubStr is not found, if Offset is greater than the length of S, or if Offset is less than 1.

So what you can do is repeatedly call PosEx, starting with Offset = 1, and each time it finds SubStr in Target you increment Offset to skip over it, like this (in a console application):

function ContainsCount(const SubStr, Target : String) : Integer;
var
  i : Integer;
begin
  Result := 0;
  i := 1;
  repeat
    i := PosEx(SubStr, Target, i);
    if i > 0 then begin
      Inc(Result);
      i := i + Length(SubStr);
    end;
  until i <= 0;
end;

var
  Count : Integer;
  Target : String;
begin
  Target := 'aa b ca';
  Count := ContainsCount('a', Target);
  writeln(Count);
  readln;
end.

The fact that PosEx and ContainsCount both pass SubStr and Target as consts meants that no string copying is involved, and it should be obvious that ContainsCount never scans the same characters more that once.

Once you've satisfied yourself that this works, you might care to trace into PosEx to see how it does its stuff.

You can do something which works in a similar way on PChars using the RTL functions StrPos/AnsiStrPos

To convert your memory stream to a string, you could use this code from Rob Kennedy's answer to this q Converting TMemoryStream to 'String' in Delphi 2009

function MemoryStreamToString(M: TMemoryStream): string;
begin
  SetString(Result, PChar(M.Memory), M.Size div SizeOf(Char));
end;

(Note what he says about the alternative version later in his answer)

Btw, if you look through the VCL + RTL code, you'll see that quite a lot of the string-parsing and processing code (e.g. in TParser, TStringList, TExpressionParser) all does its work with PChars. There's a reason for that of course, because it minimizes character copying and means that most scanning operations can be done by changing pointer (PChar) values.

MartynA
  • 30,454
  • 4
  • 32
  • 73
  • Thanks. Your main algorithm is the same as Johan's, except that you use a different inner function. So using your tip about `MemoryStreamToString` I can also use the BM function. Both are very fast. The BM is faster though – RaelB May 29 '17 at 20:44