0

Have some old code (written by someone else) that I need to fix to work with Unicode strings in Delphi 10.1. EDIT: I've narrowed my question down to the following: code below fails with unicode strings. Suggestions?

//global variable:  
var
  UpCaseLookup : array[ 1..255 ] of char; 


// ---- Knuth, Morris, Pratt:
type
    failure  = array[1..255] of word;

procedure PrepareUpcaseLookup;
var
  S : string; //was shortstring;
  i : integer;
begin
  for i := 1 to 255 do
  begin
    S := ToUpper( chr(i) ); //was AnsiUpperCase
    UpCaseLookup[i] := S[1]
  end
end;
Jarle
  • 45
  • 6
  • You don't appear to have made any attempt to translate this. Are you asking us to do it for you? Have you considered hiring a programmer? – David Heffernan Nov 01 '16 at 09:54
  • I have obviously tried to figure this out by myself, but instead of sharing my failed attempts I thought it was better to post the original code. I thought asking for help and advise was the purpose of this forum, but apparently I'm missing something. Sorry. – Jarle Nov 01 '16 at 10:00
  • Yes, I think you are missing something. It's not a forum, it's a strict question and answer site. Have you spent much time reading the articles at the [help]? – David Heffernan Nov 01 '16 at 10:00
  • 3
    Yes: "Be welcoming, be patient, and assume good intentions. Don't expect new users to know all the rules — they don't." Again, sorry for wasting your time. – Jarle Nov 01 '16 at 10:06
  • 1
    Irrespective of what I assume, I can't see a question here. The only thing I can imagine is that you want us to do the work for you. That's not appropriate as I'm sure you know. If you want finer grained help, you need to ask a more specific question. Please do so with an edit. – David Heffernan Nov 01 '16 at 10:14
  • Shortstrings are always Ansi, so change them to String, and change `ord(p[0])` to `Length(p)`. – MBo Nov 01 '16 at 11:03
  • We're glad to help. We're not a code porting or writing service, however. Make an effort to do the work yourself. If you run into difficulties, you can include the *relevant portions* of the code, explain the problem you've encountered, and ask a specific question related to that code. – Ken White Nov 01 '16 at 12:32
  • In what way does your code fail? Please be more specific. – LU RD Nov 01 '16 at 13:19
  • Access Violation. 255 makes sense for a shortstring, but not for a string (which may or may not be the solution I'm looking for). I guess I need a different approach, but this is where I'm stuck at the moment. – Jarle Nov 01 '16 at 13:29
  • Do you need case-insensitive search? – MBo Nov 01 '16 at 13:46
  • Yes, unfortunately I do. – Jarle Nov 01 '16 at 13:50
  • Make Uppercase for both strings and throw away this weird code. – MBo Nov 01 '16 at 13:52
  • Thanks. I'm sure the original developer had good reasons for doing it this way, some 17-18 years ago. The code was first written for Delphi 2 or 3, if I remember correctly. The code is still extremely fast and efficient (compiled with Delphi 7, running on Win 10), which is why I was hoping it could easily be rewritten to handle unicode. I may look into other solutions, like using SearchBuf. Anyone know how good it is, performance wise? – Jarle Nov 01 '16 at 14:36
  • So what doesn't work now? Code you wrote works fine for me. I added System.Character to 'uses' list, and that was it. UpCaseLookup was populated correctly. – Yuriy Afanasenkov Nov 01 '16 at 15:29
  • Regarding the edit that should work unless you have zero based strings. But why convert to Unicode if you restrict to 8 bit characters. And above 127 on UTF16 differs from 8 bit encodings. So you need to work harder at understanding all of this. – David Heffernan Nov 01 '16 at 21:28

1 Answers1

1
  function PosKnuthMorrisPratt(Pattern, Text: string): Integer;
  var
    Prefix: array of Integer;
    i, k: Integer;
  begin
    Result := 0;
    if (Pattern = '') or (Text = '') then
      Exit;

    Pattern := UpperCase(Pattern); // case-insensitive
    Text := UpperCase(Text);

    // Buld prefix function array
    SetLength(Prefix, Length(Pattern) + 1);
    Prefix[1] := 0;
    k := 0;
    for i := 2 to Length(Pattern) do begin
      while (k > 0) and (Pattern[k + 1] <> Pattern[i]) do
        k := Prefix[k];
      if Pattern[k + 1] = Pattern[i] then
        Inc(k);
      Prefix[i] := k;
    end;

    k := 0;
    for i := 1 to Length(Text) do begin
      while (k > 0) and (Pattern[k + 1] <> Text[i]) do
        k := Prefix[k];
      if Pattern[k + 1] = Text[i] then
        Inc(k);
      if k = Length(Pattern) then
        Exit(i + 1 - Length(Pattern));
    end;
  end;

begin
  Memo1.Lines.Add(IntToStr(PosKnuthMorrisPratt('abaBc', 'ggabagabAbccsab')));
  Memo1.Lines.Add(IntToStr(PosKnuthMorrisPratt('ab', 'ggagbc')));
MBo
  • 77,366
  • 5
  • 53
  • 86
  • Thank you! Did a quick test and works perfectly. Will be interesting to compare performance with the old code - scanning tens or even hundreds of thousands lines of text. – Jarle Nov 02 '16 at 11:08
  • Having run some tests, it seems like I will throw away my old code and start from scratch. Perhaps this should be posted as a new question, but here goes: What is the fastest way to open a big text file (easily 50 or 100 MB, UTF-8 encoded), scan the entire file, looking for a specific string (or even a single character) and copy all lines containing a match to a TStringlist (or a memo or whatever). For now, I've tested a TStreamReader and looped through the whole thing line by line. Way too slow. – Jarle Nov 04 '16 at 11:52
  • Have you checked Boyer-Moore algorithm? It is usually faster if data sets is not weird. Anyway, you'd better to write MCVE, show your real code – MBo Nov 04 '16 at 12:06
  • Check with fastcoders `PosEx` version to compare. See [Delphi: fast Pos with 64-bit](http://stackoverflow.com/a/20947429/576719). – LU RD Nov 04 '16 at 12:27
  • Thanks. I'm doing this: SR := TStreamReader.Create('D:\test.txt'); while not (SR.EndOfStream) do begin line := SR.ReadLine; if pos(editSearch.Text, line) > 0 then Memo1.Lines.Add(line); end; SR.Free; – Jarle Nov 04 '16 at 12:41
  • Sorry. Unable to format code in comments? Anyway: the above code takes about 10 seconds to search through a 187 MB file with some 2,3 million lines of text. Not bad, but my old KMP routine does the same in less than a second (ANSI only). Perhaps that's the expected penalty for adding Unicode support? – Jarle Nov 04 '16 at 12:49
  • One more thing: just scanning through the file, without searching for anything (commenting out "if pos.." and "memo1.lines.add.."), takes almost 7 seconds. Apparently, looping through line by line using TStreamReader is just too slow. – Jarle Nov 04 '16 at 12:58
  • Using linked PosEx code with Uppercase: filesize 153 MB lines treated 2272417 occurrences found 45120 ms elapsed 468. Without Uppercase: 250 ms. My KMP code: 1200 ms – MBo Nov 04 '16 at 13:06
  • Thanks, MBo. Now we're talking. Not clear how to implement this. How do you read the file? And do you still loop through the text line by line? A small example would be great! – Jarle Nov 04 '16 at 13:34
  • TStringList.Loadfromfile. I did not account for loading time – MBo Nov 04 '16 at 15:03
  • Ah, ok. Thanks anyway. Like I mentioned, loading time seems like my biggest concern right now. Just opening and traversing this 187 MB test file, without searching, using TStreamReader takes around 7 secs. – Jarle Nov 04 '16 at 15:55
  • This time seems reasonable, but depends on HDD speed. Have you checked loading in TStringList to exclude possible TStreamReader slowness? – MBo Nov 04 '16 at 16:14