-6

Does Delphi (10.4) have a string-tokenizer that extracts string-token-objects from a string in a similar way as below?

MyPhrase := 'I have a simple word and a complex Word: A lot of WORDS.';

MyTokens := MyTokenize(MyPhrase, 'word');

for i := 0 to MyTokens.Count - 1 do
  Memo1.Lines.Add(IntToStr(MyTokens[i].Pos) + ': ' + MyTokens[i].String);

Gives this result in Memo1:

16: word  
35: Word  
50: WORD

Searching for "tokenize string" in the Delphi documentation did not get any useful results for this purpose.

Of course, writing such a function is trivial, but I wonder if there already is a procedure for this in the existing huge Delphi code treasure.

EDIT: I am experimenting with a wordlist that should have the required features:

program MyTokenize;

{$APPTYPE CONSOLE}

{$R *.res}

uses
  CodeSiteLogging,
  System.RegularExpressions,
  System.Types,
  System.Classes,
  System.StrUtils,
  System.SysUtils;

type
  PWordRec = ^TWordRec;

  TWordRec = record
    WordStr: string;
    WordPos: Integer;
  end;

  TWordList = class(TList)
  private
    function Get(Index: Integer): PWordRec;
  public
    destructor Destroy; override;
    function Add(Value: PWordRec): Integer;
    property Items[Index: Integer]: PWordRec read Get; default;
  end;

function TWordList.Add(Value: PWordRec): Integer;
begin
  Result := inherited Add(Value);
end;

destructor TWordList.Destroy;
var
  i: Integer;
begin
  for i := 0 to Count - 1 do
    FreeMem(Items[i]);
  inherited;
end;

function TWordList.Get(Index: Integer): PWordRec;
begin
  Result := PWordRec(inherited Get(Index));
end;

var
  WordList: TWordList;
  WordRec: PWordRec;
  i: Integer;

begin
  try
    //MyPhrase := 'A crossword contains words but not WORD';

    WordList := TWordList.Create;
    try
      // AV only at the THIRD loop!!!
      for i := 0 to 2 do
      begin
        GetMem(WordRec, SizeOf(TWordRec));
        WordRec.WordPos := i;
        WordRec.WordStr := IntToStr(i);
        WordList.Add(WordRec);
      end;

      for i := 0 to WordList.Count - 1 do
        Writeln('WordPos: ', WordList[i].WordPos, ' WordStr: ', WordList[i].WordStr);

      WriteLn('  Press Enter to free the list');
      ReadLn;
    finally
      WordList.Free;
    end;

  except
    on E: Exception do
    begin
      Writeln(E.ClassName, ': ', E.Message);
      ReadLn;
    end;
  end;
end.

Unfortunately, it has a strange bug: It gets an AV exactly at the THIRD for loop!

EDIT2: It seems that the AV happens only when the project's Build Configuration is set to Debug. When the project's Build Configuration is set to Release then there is no AV. Has this to do with the MemoryManager?

user1580348
  • 5,721
  • 4
  • 43
  • 105
  • (I didn't downvote.) The Delphi RTL doesn't contain such a function, but it is trivial to write such a function given a precise specification of its behaviour. The downvoter might have been annoyed by the lack of such a specification in your Q. For instance, it isn't clear if `I have a crossword.` should yield (14, word) or (9, crossword) or nothing. – Andreas Rejbrand Feb 19 '21 at 18:04
  • Or maybe the downvoter considered this to be a "please write the code for me" kind of Q. – Andreas Rejbrand Feb 19 '21 at 18:08
  • @AndreasRejbrand Your `I have a crossword.` question is not logical. From my example, it is clear that it should yield (14, word). – user1580348 Feb 19 '21 at 18:15
  • The ""please write the code for me" kind of Q" statement offends me: If such a solution already exists, it is perfectly legitimate to ask for the solution. – user1580348 Feb 19 '21 at 18:18
  • 3
    @user1580348: Please notice that I *didn't* downvote. I was just guessing why someone else did, based on my knowledge of the SO community. No need to kill the messenger. – Andreas Rejbrand Feb 19 '21 at 18:20
  • Yes, it is trivial to write such a function. But I have asked whether such a solution already exists. So why invent the wheel twice? – user1580348 Feb 19 '21 at 18:21
  • 1
    I agree, it is a valid question. Just saying that interpretation might not be obvious to everyone. But with your comment here, it does become obvious. You might want to include the phrase "Of course, writing such a function is trivial, but I wonder if there already is a standard facility for this in the Delphi RTL" or something similar in the Q next time. Regarding the "crosswords", I might warn you that mathematically-inclined people don't consider examples to be a substitute for a precise specification. – Andreas Rejbrand Feb 19 '21 at 18:24
  • @AndreasRejbrand Thanks for the warning. But many times an example is more worth than a "specification". I am empirical-inclined. – user1580348 Feb 19 '21 at 18:31
  • 1
    *"From my example, it is clear that it should yield (14, word)"* I understood the opposite. – Olivier Feb 19 '21 at 18:34
  • The longer these comments go on, the more I am convinced this q deserves to be closed, because as written it is unanswerable. – MartynA Feb 19 '21 at 18:36
  • 1
    *"But many times an example is more worth than a "specification""* A question like this should provide both a specification and some examples. – Olivier Feb 19 '21 at 18:36
  • @Olivier: I agree, and they should be unambiguous examples. At the moment, it seems to be all in the eye of the OP. – MartynA Feb 19 '21 at 18:39
  • 3
    Under the usual rules, tokenization will consider "crossword" a single token, so it can't match "word". Using the term "token" is incorrect here. It seems that all you want to do is find occurrences of a substring inside a string (in a case-insensitive way). – Olivier Feb 19 '21 at 18:56
  • @user1580348 "_I am empirical-inclined_" - yes, I [saw that already](https://stackoverflow.com/q/66103102/4299358) and the downside is not being confident in basics. That's why you misuse terms like "token" and "object". What you want seems to be a substring only, but why you don't want to name it as such and didn't specify it right away so far remains in your head. And we're on the outside. [Even the dictionary](https://en.wiktionary.org/wiki/token) describes token as "word", and words are never enclosed in other letters. – AmigoJack Feb 19 '21 at 19:17
  • @AndreasRejbrand Thanks for your constructive comments. I have added some code to the question that is my approach to a solution. Unfortunately, it gets a strange AV only in the third for loop! – user1580348 Feb 19 '21 at 22:03
  • Which part of the code you have added is suppose to extract the i'th token and how? – MartynA Feb 19 '21 at 22:31
  • @MartynA Please read the text in the question: It is an **EXPERIMENTAL** code. As soon as the AV problem is solved I will insert the code to extract the word records. Be patient! – user1580348 Feb 19 '21 at 23:32
  • @user1580348: I'm afraid your memory management is wrong. Since `TWordRec` is a record with a managed type (specifically, the string), you cannot allocate it using `GetMem`. If your application doesn't crash, that's just "luck". You need to use `New` instead. But there really isn't any need to use such ancient techniques at all. Instead, use a `TList` from `Generics.Collections` (and remove `PWordRec`). You should only use low-level techniques if you have a solid u11g of the [internals](http://docwiki.embarcadero.com/RADStudio/en/Internal_Data_Formats_(Delphi)#Long_String_Types) – Andreas Rejbrand Feb 19 '21 at 23:48
  • 2
    What's happening in your code is that, after `GetMem`, `WordRec` points to a newly-allocated `SizeOf(TWordRec)`-sized region in memory. Since `GetMem` doesn't fill this block with zeros, `WordRec.WordStr` will be a random pointer. Hence, when you do `WordRec.WordStr := '...'`, the RTL will go to this random location in memory, believing it to be a string heap object, and reduce its "refcount" by 1. In other words, it will make a "random" change to a "random" place in memory. Then anything can happen. Hopefully an AV. – Andreas Rejbrand Feb 20 '21 at 00:21
  • @AndreasRejbrand To make a quick fix, I have replaced `GetMem` with `WordRec := AllocMem(SizeOf(TWordRec));`, since `AllocMem` fills the allocated memory with zeros and can also be freed with `FreeMem`. Is this correct? However, now there is no more AV when using `AllocMem`. – user1580348 Feb 20 '21 at 08:20
  • @AndreasRejbrand Following your advice I now use `WordRec := System.New(PWordRec);` and then to free it: `System.Dispose(Items[i]);`. Is that correct? – user1580348 Feb 20 '21 at 08:40
  • 1
    Why don't you follow the real advice, which is to use `TList`? – Olivier Feb 20 '21 at 08:42
  • 1
    Sorry, I don't see why we readers should be patient. The behaviour you are now asking about is a completely different thing than the "Does Delphi have a string-tokenizer?" you originally asked. This q should be closed because you are just wasting readers' time. – MartynA Feb 20 '21 at 09:30
  • @user1580348: Yes, New and Dispose is the pair of function you use to dynamically allocate records with managed members on the heap. However, there really isn't any need to use such low-level approaches here. A `TList` is definitely the right thing to use here. I always use that approach these days. – Andreas Rejbrand Feb 20 '21 at 11:14
  • @AndreasRejbrand Thank you for your constructive input! Frankly, I am not that familiar with Generics and I wouldn't know how to use `TList` in the context of this solution. However, the solution presented here does work. – user1580348 Feb 20 '21 at 12:45
  • @user1580348: It is actually much easier to use generics. Just add `Generics.Collections` to the `uses` clause and then you can do `var L: TList` and `L := TList.Create; try L.Add(MyWordRec); ... for i := 0 to L.Count - 1 do ShowMessage(L[i].WordStr)` or even `for WordRec in L do`. It really couldn't be easier! If you are afraid of angular brackets, you can define `type TWordList = TList` and just use this type. – Andreas Rejbrand Feb 20 '21 at 13:02

3 Answers3

3

On request, here is how I would do this myself:

Screenshot of application

First, I want to create a function that performs this operation, so it can be reused every time we need to do this.

I could have this function return or populate a TList<TWordRec>, but then it would be tiresome to work with it, because the user of the function would then need to add try..finally blocks every time the function is used. Instead, I let it return a TArray<TWordRec>. By definition, this is simply array of TWordRec, that is, a dynamic array of TWordRecs.

But how to efficiently populate such an array? We all know you shouldn't increase the length of a dynamic array one element at a time; besides, that requires a lot of code. Instead, I populate a local TList<TWordRec> and then, as a last step, create an array from it:

type
  TPhraseMatch = record
    Position: Integer;
    Text: string;
  end;

function GetPhraseMatches(const AText, APhrase: string): TArray<TPhraseMatch>;
begin

  var TextLower := AText.ToLower;
  var PhraseLower := APhrase.ToLower;

  var List := TList<TPhraseMatch>.Create;
  try

    var p := 0;
    repeat
      p := Pos(PhraseLower, TextLower, p + 1);
      if p <> 0 then
      begin
        var Match: TPhraseMatch;
        Match.Position := p - 1 {since the OP wants 0-based string indexing};
        Match.Text := Copy(AText, p, APhrase.Length);
        List.Add(Match);
      end;
    until p = 0;

    Result := List.ToArray;

  finally
    List.Free;
  end;

end;

Notice that I chose an alternative to the regular expression approach, just for educational reasons. I also believe this approach is faster. Also notice how easy it is to work with the TList<TWordRec>: it's just like a TStringList but with word records instead of strings!

Now, let's use this function:

procedure TWordFinderForm.ePhraseChange(Sender: TObject);
begin

  lbMatches.Items.BeginUpdate;
  try
    lbMatches.Items.Clear;
    for var Match in GetPhraseMatches(mText.Text, ePhrase.Text) do
      lbMatches.Items.Add(Match.Position.ToString + ':'#32 + Match.Text)
  finally
    lbMatches.Items.EndUpdate;
  end;

end;

Had I not chosen to use a function, but placed all code in one block, I could have iterated over the TList<TWordRec> in exactly the same way:

for var Match in List do
Andreas Rejbrand
  • 105,602
  • 8
  • 282
  • 384
  • Depending on the precise specification, you may want to begin the next search at `p + APhrase.Length` instead. Try to search for `w` in `wwwwww` to see the difference. – Andreas Rejbrand Feb 20 '21 at 17:42
  • Very nice advanced code! I have rebuilt it as a Windows 32 VCL Application in Delphi 10.4.1. Which is the lowest Delphi version working with this code? – user1580348 Feb 21 '21 at 13:44
  • Thanks! The example app I made (the screenshot) is also a 32-bit VCL Windows application. (I never touch any other platform or FMX!) The code uses generics, which was introduced in Delphi 2009 if I recall correctly. It also uses record helpers (for instance: `MyString.Length` instead of `Length(MyString)` and `MyString.ToUpper` instead of `AnsiUpperCase(MyString)`. I don't recall when these were added (XE3?). Finally, as you can see, I use inline variable declarations, which were added in Delphi 10.3 I think. So you need 10.3, but with *very* minor changes, you can make it work in Delphi 2009. – Andreas Rejbrand Feb 21 '21 at 13:50
  • So when users have an older Delphi version they could use my solution below. – user1580348 Feb 21 '21 at 13:53
  • Everyone can use it, but if you are using Delphi 2009 and above, by far the easiest (and most robust) solution is to use `TList` as I do (but I chose to rename the record because it can catch also parts of words). – Andreas Rejbrand Feb 21 '21 at 13:56
  • BTW, the use of **For Loops With Inline Loop Variable Declaration** is explained here: https://blog.marcocantu.com/blog/2018-october-inline-variables-delphi.html – user1580348 Feb 21 '21 at 14:00
  • 1
    Ah, I forgot: you use regular expressions. That's also a fairly new addition (XE?), so people using an older version need to use my Pos-based approach instead. – Andreas Rejbrand Feb 21 '21 at 14:00
  • It is good to be aware of how many innovations the Delphi language has seen in recent years. – user1580348 Feb 21 '21 at 14:05
  • Nice demo of inline variable declarations, +1. – MartynA Feb 21 '21 at 17:25
  • @MartynA: Thanks! I have become really fond of those. They add a new level of safety: its very nice to know that a variable *cannot* be used before it is initialized and that its scope is as small as possible. I'm almost considering this to be best practice today. We'll see if the rest of the Delphi community will also come to this conclusion. – Andreas Rejbrand Feb 21 '21 at 19:21
1

Largely for my own amusement, I decided to write an answer which tokenizes the input in the same way Delphi's compiler does. This is shown below.

Of course, the OP's requirement that the code should match the 'WORD' in 'WORDS' precludes a direct comparison between the Target string and Parser.TokenString and necessitates the derivation of Fragment as written.

It shows, btw, that the use of constructs such a PWordRec and manual allocation and de-allocation of the 'tokens' are not necessary.

    program StringTokens;

    {$APPTYPE CONSOLE}

    {$R *.res}

    uses
      System.SysUtils, System.Classes;

    var
      Parser : TParser;
      MyPhrase : String;
      Target : String;
      Fragment : String;
      SS : TStringStream;
      List : TStringList;
      i : Integer;
    begin

      MyPhrase := 'I have a simple word and a complex Word: A lot of WORDS. A partial wor';
      Target := 'word';
      SS := TStringStream.Create(MyPhrase);
      List := TStringlist.Create;
      Parser := TParser.Create(SS);

      try
        while Parser.Token <> #0 do begin
          Fragment := Copy(Parser.TokenString, 1, Length(Target));
          if SameText(Fragment, Target) then
            List.Add(Fragment);
          Parser.NextToken;
        end;

        for i := 0 to List.Count - 1 do
          writeln(i, List[i]);
        readln;
      finally
        List.Free;
        Parser.Free;
        SS.Free;
      end;
    end.

Update:

In case it isn't obvious, it's trivial to obtain the positions in the source string where the token fragments occur, as follows

    [...]
    if SameText(Fragment, Target) then
      List.AddObject(Fragment, TObject(Parser.SourcePos));

    [...]
    for i := 0 to List.Count - 1 do
      writeln(i, List[i], integer(List.Objects[i]));
MartynA
  • 30,454
  • 4
  • 32
  • 73
0

This gives the result as required in the question:

enter image description here

EDIT: I have now simplified the code by using WordRec.WordPos := MatchResult.Index;

EDIT2: Cleaned up the uses list

program MyTokenize;

{$APPTYPE CONSOLE}

{$R *.res}

uses
  System.RegularExpressions,
  System.Classes,
  System.SysUtils;

type
  PWordRec = ^TWordRec;

  TWordRec = record
    WordStr: string;
    WordPos: Integer;
  end;

  TWordList = class(TList)
  private
    function Get(Index: Integer): PWordRec;
  public
    destructor Destroy; override;
    function Add(Value: PWordRec): Integer;
    property Items[Index: Integer]: PWordRec read Get; default;
  end;

function TWordList.Add(Value: PWordRec): Integer;
begin
  Result := inherited Add(Value);
end;

destructor TWordList.Destroy;
var
  i: Integer;
begin
  for i := 0 to Count - 1 do
  begin
    System.Dispose(Items[i]);
  end;
  inherited;
end;

function TWordList.Get(Index: Integer): PWordRec;
begin
  Result := PWordRec(inherited Get(Index));
end;

var
  WordList: TWordList;
  WordRec: PWordRec;
  i: Integer;
  RegexObj: TRegEx;
  MatchResult: TMatch;
  MyPhrase, MyWord: string;

begin
  try
    MyPhrase := 'A crossword contains words but not WORD';
    MyWord := 'word';

    WordList := TWordList.Create;
    try
      RegexObj := TRegEx.Create(MyWord, [roIgnoreCase]);
      MatchResult := RegexObj.Match(MyPhrase);
      while MatchResult.Success do
      begin
        WordRec := System.New(PWordRec);
        WordRec.WordPos := MatchResult.Index;
        WordRec.WordStr := MatchResult.Value;
        WordList.Add(WordRec);
        MatchResult := MatchResult.NextMatch;
      end;

      // Output:
      for i := 0 to WordList.Count - 1 do
        Writeln('WordPos: ', WordList[i].WordPos, ' WordStr: ', WordList[i].WordStr);

      WriteLn('  Press Enter to free the list');
      ReadLn;
    finally
      WordList.Free;
    end;

  except
    on E: Exception do
    begin
      Writeln(E.ClassName, ': ', E.Message);
      ReadLn;
    end;
  end;
end.
user1580348
  • 5,721
  • 4
  • 43
  • 105
  • Please note that you must use `Dispose`. If you use `FreeMem` you will leak the long string heap objects, as you can see if you enable memory leak reports (`ReportMemoryLeaksOnShutdown := True`). (And, of course, today it is a much better idea to use a `TList`. Then you need *no* boilerplate code and get guaranteed type safety and memory safety, and you need no knowledge of generics as a concept.) – Andreas Rejbrand Feb 20 '21 at 13:09
  • Thanks, I had left the comment as a warning to myself. Now it is removed. – user1580348 Feb 20 '21 at 13:12
  • Thanks! And my comment about generics is mainly for others who come to this page from Google. – Andreas Rejbrand Feb 20 '21 at 13:13
  • @AndreasRejbrand If you create a solution from my solution using your proposed Generics solution, then I will accept it as the main solution. – user1580348 Feb 20 '21 at 17:00
  • I wrote an answer showing you how I would solve the task. Basically, it is your solution but with a generic list and a non-regex search method packaged into a function returning a dynamic array for convenience. – Andreas Rejbrand Feb 20 '21 at 18:03