2

I have these three functions that successfully remove all non-numeric characters from a given string:

The first function loops through the characters of the input string, and if the current character is a number, it adds it to a new string that is returned as the result of the function.

  function RemoveNonNumericChars(const s: string): string;
  begin
    Result := '';
    for var i := 1 to Length(s) do
    begin
      if s[i] in ['0'..'9'] then
        Result := Result + s[i];
    end;
  end;

The second function loops through the characters of the input string from right to left, and if the current character is not a number, it uses the Delete function to remove it from the string

  function RemoveNonNumericChars(const s: string): string;
  begin
    Result := s;
    for var i := Length(Result) downto 1 do
    begin
      if not(Result[i] in ['0'..'9']) then
        Delete(Result, i, 1);
    end;
  end;

The third function uses a regular expression to replace all non-numeric characters with nothing, thus removing them. TRegEx is from the System.RegularExpressions unit.

  function RemoveNonNumericChars(const s: string): string;
  begin
    var RegEx := TRegEx.Create('[^0-9]');
    Result := RegEx.Replace(s, '');
  end;

All three of them do what I need, but I want to know if there is maybe a built-in function in Delphi for this... Or maybe even a better way to do it than the way I'm doing it. What's the best and/or fastest way to remove non-numeric characters from a string in Delphi?

Shaun Roselt
  • 1,650
  • 5
  • 18
  • 44
  • 1
    Regarding "best and/or fastest", that very much depends on the precise situation. And there's a huge difference between "best" and "fastest". You may be surprised to learn that even the slowest approach on this page can handle an extremely large number of inputs like "1561g894861a561b861b" per second! The fastest approach can handle many times that, but will be much less readable and maintainable. If you need to process GUI input, maybe it isn't worth making the code readable only to a CS PhD just to reduce the runtime from 0.2 ms to 0.01 ms (made-up numbers)? – Andreas Rejbrand Jan 18 '23 at 13:38
  • 1
    As per context "_numeric_" characters may also include the minus (negative number) and a decimal separator (real number). And as per exponential notation also `e`. – AmigoJack Jan 18 '23 at 13:52

1 Answers1

8

Both your approaches are slow because you constantly change the length of the string. Also, they only recognise Arabic digits.

To solve the performance issue, preallocate the maximum result length:

function RemoveNonDigits(const S: string): string;
begin
  SetLength(Result, S.Length);
  var LActualLength := 0;
  for var i := 1 to S.Length do
    if CharInSet(S[i],  ['0'..'9']) then
    begin
      Inc(LActualLength);
      Result[LActualLength] := S[i];
    end;
  SetLength(Result, LActualLength);
end;

To support non-Arabic digits, use the TCharacter.IsDigit function:

function RemoveNonDigits(const S: string): string;
begin
  SetLength(Result, S.Length);
  var LActualLength := 0;
  for var i := 1 to S.Length do
    if S[i].IsDigit then
    begin
      Inc(LActualLength);
      Result[LActualLength] := S[i];
    end;
  SetLength(Result, LActualLength);
end;

To optimise even further, as suggested by Stefan Glienke, you can bypass the RTL's string handling machinery and write each character directly with some loss of code readability:

function RemoveNonDigits(const S: string): string;
begin
  SetLength(Result, S.Length);
  var ResChr := PChar(Result);
  var LActualLength := 0;
  for var i := 1 to S.Length do
    if CharInSet(S[i],  ['0'..'9']) then
    begin
      Inc(LActualLength);
      ResChr^ := S[i];
      Inc(ResChr);
    end;
  SetLength(Result, LActualLength);
end;

Benchmark

Just for fun I did a very primitive benchmark on random input strings with length < 100 and about 24% chance of a char being a digit:

program Benchmark;

{$APPTYPE CONSOLE}

{$R *.res}

uses
  System.SysUtils, System.RegularExpressions, Windows;

function OP1(const s: string): string;
begin
  Result := '';
  for var i := 1 to Length(s) do
  begin
    if s[i] in ['0'..'9'] then
      Result := Result + s[i];
  end;
end;

function OP2(const s: string): string;
begin
  Result := s;
  for var i := Length(Result) downto 1 do
  begin
    if not(Result[i] in ['0'..'9']) then
      Delete(Result, i, 1);
  end;
end;

function OP3(const s: string): string;
begin
  var RegEx := TRegEx.Create('[^0-9]');
  Result := RegEx.Replace(s, '');
end;

function AR1(const S: string): string;
begin
  SetLength(Result, S.Length);
  var LActualLength := 0;
  for var i := 1 to S.Length do
    if CharInSet(S[i],  ['0'..'9']) then
    begin
      Inc(LActualLength);
      Result[LActualLength] := S[i];
    end;
  SetLength(Result, LActualLength);
end;

function AR2(const S: string): string;
begin
  SetLength(Result, S.Length);
  var ResChr := PChar(Result);
  var LActualLength := 0;
  for var i := 1 to S.Length do
    if CharInSet(S[i],  ['0'..'9']) then
    begin
      Inc(LActualLength);
      ResChr^ := S[i];
      Inc(ResChr);
    end;
  SetLength(Result, LActualLength);
end;

function AR3(const S: string): string;
begin
  SetLength(Result, S.Length);
  var ResChr := PChar(Result);
  for var i := 1 to S.Length do
    if CharInSet(S[i],  ['0'..'9']) then
    begin
      ResChr^ := S[i];
      Inc(ResChr);
    end;
  SetLength(Result, ResChr - PChar(Result));
end;

function RandomInputString: string;
begin
  SetLength(Result, Random(100));
  for var i := 1 to Result.Length do
    Result[i] := Chr(Ord('0') + Random(42));
end;

begin

  Randomize;

  const N = 1000000;

  var Data := TArray<string>(nil);
  SetLength(Data, N);
  for var i := 0 to N - 1 do
    Data[i] := RandomInputString;

  var f, c0, cOP1, cOP2, cOP3, cAR1, cAR2, cAR3: Int64;

  QueryPerformanceFrequency(f);

  QueryPerformanceCounter(c0);
  for var i := 0 to High(Data) do
    OP1(Data[i]);
  QueryPerformanceCounter(cOP1);
  Dec(cOP1, c0);

  QueryPerformanceCounter(c0);
  for var i := 0 to High(Data) do
    OP2(Data[i]);
  QueryPerformanceCounter(cOP2);
  Dec(cOP2, c0);

  QueryPerformanceCounter(c0);
  for var i := 0 to High(Data) do
    OP3(Data[i]);
  QueryPerformanceCounter(cOP3);
  Dec(cOP3, c0);

  QueryPerformanceCounter(c0);
  for var i := 0 to High(Data) do
    AR1(Data[i]);
  QueryPerformanceCounter(cAR1);
  Dec(cAR1, c0);

  QueryPerformanceCounter(c0);
  for var i := 0 to High(Data) do
    AR2(Data[i]);
  QueryPerformanceCounter(cAR2);
  Dec(cAR2, c0);

  QueryPerformanceCounter(c0);
  for var i := 0 to High(Data) do
    AR3(Data[i]);
  QueryPerformanceCounter(cAR3);
  Dec(cAR3, c0);

  Writeln('Computations per second:');
  Writeln('OP1: ', Round(N / (cOP1 / f)));
  Writeln('OP2: ', Round(N / (cOP2 / f)));
  Writeln('OP3: ', Round(N / (cOP3 / f)));
  Writeln('AR1: ', Round(N / (cAR1 / f)));
  Writeln('AR2: ', Round(N / (cAR2 / f)));
  Writeln('AR3: ', Round(N / (cAR3 / f)));

  Readln;

end.

Result:

Computations per second:
OP1: 1398134
OP2: 875116
OP3: 39162
AR1: 3406172
AR2: 4063260
AR3: 4032343

Bar chart with the results.

As you can see, in this test at least, regular expressions are by far the slowest approach. And preallocating makes a major difference, while avoiding the _UniqueStringU issue appears to make only a relatively minor improvement.

But even with the very slow RegEx approach, you can do 40 000 calls per second. On my 13-year-old computer.

Andreas Rejbrand
  • 105,602
  • 8
  • 282
  • 384
  • 1
    Your code includes a common mistake (wrt performance) when dealing with strings - every write to a location always includes a call to `System._UniqueStringU`. Since you know that `Result` is already unique given you called `SetLength` on it this can be avoided by writing to it via `PChar`. That avoids unnecessary register preservation because then there is no call anymore within the loop body. – Stefan Glienke Jan 18 '23 at 12:06
  • @StefanGlienke: That is correct. You can make it even faster by bypassing that machinery. – Andreas Rejbrand Jan 18 '23 at 12:08
  • @StefanGlienke: Is there a more readable version than the one I added now? – Andreas Rejbrand Jan 18 '23 at 12:15
  • 1
    Depends on your definition of readable - I can read pointer magic well, others don't. :) – Stefan Glienke Jan 18 '23 at 12:18
  • If I had to nitpick I would say that a) reading `S[I]` twice (yes, a good compiler could notice and only read once but well...) could be avoided by a local variable and b) that the `LActualLength` variable is unnecessary because the required Length can be calculated via `ResChr - Pointer(Result)` – Stefan Glienke Jan 18 '23 at 12:28
  • I added a third example function using regex to my question. Any interesting facts I should know about using that method compared to the other ones listed? – Shaun Roselt Jan 18 '23 at 12:38
  • @ShaunRoselt: Probably even slower. – Andreas Rejbrand Jan 18 '23 at 13:18
  • @StefanGlienke: I did consider those things, but since they would further reduce the readability of the code I chose not to do it like that. In the vast majority of practical applications, even the slowest approach on this Q&A page is likely fast enough (with a huge margin too!). – Andreas Rejbrand Jan 18 '23 at 13:20
  • You can make it even faster by unrolling the loop. – Pieter B Jan 19 '23 at 09:43