5

I use IN a lot in my project and I have lots of these warnings:

[DCC Warning] Unit1.pas(40): W1050 WideChar reduced to byte char in set expressions. Consider using CharInSet function in SysUtils unit.

I made a quick test and using CharInSet instead of IN is from 65%-100% slower:

if s1[i] in ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'] then

vs

if CharInSet(s1[i], ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']) then

Here is code for 2 tests, one works with loop through shorter strings, one loops once through a large string:

Adding 2 buttons on form I tested this for short string:

procedure TForm1.Button1Click(Sender: TObject);
var s1: string;
  t1, t2: TStopWatch;
  a, i, cnt, vMaxLoop: Integer;
begin
  s1 := '[DCC Warning] Unit1.pas(40): W1050 WideChar reduced to byte char in set expressions.  Consider using CharInSet function in SysUtils unit.';
  vMaxLoop := 10000000;

  cnt := 0;
  t1 := TStopWatch.Create;
  t1.Start;
  for a := 1 to vMaxLoop do
    for i := 1 to Length(s1) do
      if s1[i] in ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'] then
        inc(cnt);
  t1.Stop;

  cnt := 0;
  t2 := TStopWatch.Create;
  t2.Start;
  for a := 1 to vMaxLoop do
    for i := 1 to Length(s1) do
      if CharInSet(s1[i], ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']) then
        inc(cnt);
  t2.Stop;

  Button1.Caption := inttostr(t1.ElapsedMilliseconds) + ' - ' + inttostr(t2.ElapsedMilliseconds);
end;

And this for 1 long string:

procedure TForm1.Button2Click(Sender: TObject);
var s1: string;
  t1, t2: TStopWatch;
  a, i, cnt, vMaxLoop: Integer;
begin

  s1 := '[DCC Warning] Unit1.pas(40): W1050 WideChar reduced to byte char in set expressions.  Consider using CharInSet function in SysUtils unit.';
  s1 := DupeString(s1, 1000000);
  s1 := s1 + s1 + s1 + s1; // DupeString is limited, use this to create longer string

  cnt := 0;
  t1 := TStopWatch.Create;
  t1.Start;
  for i := 1 to Length(s1) do
    if s1[i] in ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'] then
      inc(cnt);
  t1.Stop;

  cnt := 0;
  t2 := TStopWatch.Create;
  t2.Start;
  for i := 1 to Length(s1) do
    if CharInSet(s1[i], ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']) then
      inc(cnt);
  t2.Stop;

  Button2.Caption := inttostr(t1.ElapsedMilliseconds) + ' - ' + inttostr(t2.ElapsedMilliseconds);
end;

Why do they recommend slower option, or how can I fix this warning without penalty in performance?

Mike Torrettinni
  • 1,816
  • 2
  • 17
  • 47
  • 2
    Take a look at http://stackoverflow.com/questions/332948/why-is-charinset-faster-than-case-statement – RBA Mar 11 '16 at 14:09
  • I saw that question, but dismissed it due to title - can't rad everything. But, it does include some valuable info. – Mike Torrettinni Mar 11 '16 at 14:33
  • 1
    Can somebody explain whether this warning is relevant or not in this case? The set contains only ascii chars and if `s1[i]` is a wide char, the delphi win32 compiler generates correct code for that comparison. I tried `s1 := 'Ł'; if (s1[1] in ['A']) then ...`, because `Byte('Ł') = 65 = Byte('A')`. But for this comparison the compiler generates correct code. – ventiseis Mar 11 '16 at 14:51
  • A comment and a question. Your set literal can be written `['A'..'Z']`. Does this literal appear in your code in more than one place? – David Heffernan Mar 11 '16 at 14:57
  • Quickly counted it appears around 20 times. Didn't realize until you asked... I assume you recommend using your `IsUpperCaseEnglishLetter` if applicable, right? – Mike Torrettinni Mar 11 '16 at 15:01
  • 2
    It's really important to follow the DRY principle. **D**on't **R**epeat **Y**ourself. – David Heffernan Mar 11 '16 at 15:05
  • 1
    @DavidHeffernan `ord('Ł') = 321`. I looked into the cpu window and the noticed that the full character code was loaded into `EAX`, nothing was shortened. – ventiseis Mar 11 '16 at 15:19
  • 2
    The CharInSet function suffers from the fact that the _set_ must be stored in memory because it is specified as a function argument. Thus the compiler can't generate fast arithmetic instructions that operate on CPU registers. It has to use the much slower memory bit-test instruction. So there are multiple memory accesses per iteration (non-inlined: function call, bit-test, function return; inlined: 2x stack juggling, bit-test) compared to none. – Andreas Hausladen Mar 11 '16 at 15:21
  • @ventiseis Yeah, I remembered wrongly. It's the other side that matters – David Heffernan Mar 11 '16 at 15:26

1 Answers1

10

The warning is telling you that your code may be defective. Because sets can only be based on types with ordinality of 256 or less, the base type is truncated to that size. Now, Char is an alias for WideChar and has ordinality 65536. So the warning is there to tell you that your program may not behave as you expect. For instance, one might ask what this expression evaluates to:

['A', chr(256)] = ['A']

One might expect it to evaluate false, but in fact it evaluates true. So I think you should certainly take heed of the compiler when it issues this warning.

Now, it so happens that your set, which can and should be written more concisely as ['A'..'Z'], is made up entirely of ASCII characters. And it happens (thanks to commentors Andreas and ventiseis) that in that case the compiler generates correct code for such a set, regardless of the ordinal value of the character to the left of the in operator. So

if s1[i] in ['A'..'Z'] then

will result in correct code, in spite of the warning. And the compiler is able to detect that the set's elements are contiguous and generate efficient code.

Note that this does depend on the set being a literal and so the optimisation can be performed by the compiler. And that is why it can perform so much better than CharInSet. Because CharInSet is a function, and the Delphi optimiser has limited power, CharInSet is not able to take advantage of the contiguous nature of this specific set literal.

The warning is annoying though, and do you really want to rely on remembering the very specific details of when this warning can safely be ignored. Another way to implement the test, and sidestep this warning is to use inequality operators:

if (c >= 'A') and (c <= 'Z') then
  ....

You'd probably wrap this in an inlined function to make the code even easier to read.

function IsUpperCaseEnglishLetter(c: Char): Boolean; inline;
begin
  Result := (c >= 'A') and (c <= 'Z');
end;

You should also ask yourself whether or not this code is a performance bottleneck. You should time your real program rather than such an artificial program. I'll bet that this code isn't a bottleneck and if so you should not treat performance as the key driver.

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
  • Correct, is not a bottleneck, but if it doubles the execution time, it could become. FYI, I tested and the difference in performance with your inequality operators is minimal, 2384 -> 2355, 975 -> 959 ms. – Mike Torrettinni Mar 11 '16 at 14:26
  • 1
    Even if it doubles execution time it's unlikely to be a bottleneck. Your program does more than this I am sure. – David Heffernan Mar 11 '16 at 14:34
  • 3
    "Sets are far less efficient": Not in this case. The compiler is smart enough to change the long element list to ['A'..'Z'] itself and then it uses the fast `if (c >= 'A') and (c <= 'Z')` to implement the in-operator. And that also with correct code for WideChar as long as the set elements are Ord(x)<=#127. – Andreas Hausladen Mar 11 '16 at 15:10
  • @AndreasHausladen Yeah, you are right, the compiler does manage to optimise this one perfectly well. Even the 64 bit compiler. – David Heffernan Mar 11 '16 at 15:14
  • Very valuable additional details, David! – Mike Torrettinni Mar 11 '16 at 19:07
  • Use AnsiStrings instead of Strings. Unicode strings are inefficient. – Andy k Mar 14 '16 at 12:19