2

I have found this WideStringToString() function to convert a Unicode string to an ANSI string. I need to convert a string like àèéìòù to aeeiou, so all accents should be removed. I think it could be done with that function, but which codepage should I use?

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
Walter Schrabmair
  • 1,251
  • 2
  • 13
  • 26
  • You could perhaps convert from TEncoding.Unicode to TEncoding.ASCII. The latter will most definitely not contain any accents. See the help for [TEncoding](http://docwiki.embarcadero.com/Libraries/Rio/en/System.SysUtils.TEncoding). – Rudy Velthuis Feb 23 '19 at 15:32
  • @RudyVelthuis except that the accents will likely get converted to `?` instead of their ASCII counterparts. `TEncoding` in not good about performing **transliteration** – Remy Lebeau Feb 23 '19 at 18:17
  • @Remy: It seems to work for the accents in my example code below. But obviously not for foreign characters like epsilon. It is not Google Translate, of course. – Rudy Velthuis Feb 23 '19 at 18:18
  • 2
    Maybe this helps: https://stackoverflow.com/questions/1891196/convert-hi-ansi-chars-to-ascii-equivalent-%c3%a9-e – Uli Gerhardt Feb 23 '19 at 18:52
  • @UliGerhardt: note that the accepted answer uses WideCharToMultiByte, which is used by TEncoding too (on Windows). – Rudy Velthuis Feb 23 '19 at 21:22

1 Answers1

4

The current way to do this is to use System.SysUtils.TEncoding. An example:

function RemoveAccents(const Src: string): string;
var
  Bytes: TBytes;
begin
  Bytes := TEncoding.ASCII.GetBytes(Src);
  Result := TEncoding.ASCII.GetString(Bytes);
end;

procedure Test;
begin
  Writeln(RemoveAccents('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ'));
  Writeln(RemoveAccents('àèéìòù'));
end;

For some unknown reason this couldn't convert the epsilon (ε), so the output is:

Th? quick brown fox jump?d over the lazy dog
aeeiou
Rudy Velthuis
  • 28,387
  • 5
  • 46
  • 94
  • I tested with NormalizeString and it does not normalize ε either. I looked [here](https://www.unicode.org/charts/beta/normalization/chart_Greek.html) to see if it is expected but I didn't understand anything from that chart. – Sertac Akyuz Feb 23 '19 at 19:49
  • @Sertac: I think it says that epsilon is never composed, i.e. always a single value codepoint. But well, several of these look like an epsilon, so it is pretty confusing. – Rudy Velthuis Feb 23 '19 at 19:56
  • Thanks a lot for your advices! Epsilon will not occur in my data, so this is a suitable solution! – Walter Schrabmair Feb 24 '19 at 08:01