0

What is the text element in the context of the System.Globalization.StringInfo in C#?

I come over the concept of the text elements while learning the C# through the CLR via C#. And it seems that the book reader should clearly understand what the text element is all about. But I am not getting that concept at all.

Also, the documentation is not very verbose on the topic.

I would like to find out the definition of what the text element is.

My guess is that it is just a Unicode character, which is not necessary represented by a single System.Char (because in some cases it can be represented by two System.Char: high and low surrogates). But I am not sure that my guess is correct.

My other guess is that it is a whole word.

The text elements are mentioned in this piece of code in the CLR via C# book:

using System;
using System.Text;
using System.Globalization;
using System.Windows.Forms;
public sealed class Program {
 public static void Main() {
  // The string below contains combining characters
  String s = "a\u0304\u0308bc\u0327";
  SubstringByTextElements(s);
  EnumTextElements(s);
  EnumTextElementIndexes(s);
 }
 private static void SubstringByTextElements(String s) {
  String output = String.Empty;
  StringInfo si = new StringInfo(s);
  for (Int32 element = 0; element < si.LengthInTextElements; element++) {
   output += String.Format(
    "Text element {0} is '{1}'{2}",
    element, si.SubstringByTextElements(element, 1),
    Environment.NewLine);
  }
  MessageBox.Show(output, "Result of SubstringByTextElements");
 }
 private static void EnumTextElements(String s) {
  String output = String.Empty;
  TextElementEnumerator charEnum =
   StringInfo.GetTextElementEnumerator(s);
  while (charEnum.MoveNext()) {
   output += String.Format(
    "Character at index {0} is '{1}'{2}",
    charEnum.ElementIndex, charEnum.GetTextElement(),
    Environment.NewLine);
  }
  MessageBox.Show(output, "Result of GetTextElementEnumerator");
 }
 private static void EnumTextElementIndexes(String s) {
  String output = String.Empty;
  Int32[] textElemIndex = StringInfo.ParseCombiningCharacters(s);
  for (Int32 i = 0; i < textElemIndex.Length; i++) {
   output += String.Format(
    "Character {0} starts at index {1}{2}",
    i, textElemIndex[i], Environment.NewLine);
  }
  MessageBox.Show(output, "Result of ParseCombiningCharacters");
 }
}
qqqqqqq
  • 1,831
  • 1
  • 18
  • 48
  • 2
    Under **Remarks**, it says, *".NET defines a text element as a unit of text that is displayed as a single character, that is, a grapheme. A text element can be a base character, a surrogate pair, or a combining character sequence."* – madreflection Jan 13 '20 at 18:54
  • Related reading: http://www.joelonsoftware.com/articles/Unicode.html – GSerg Jan 13 '20 at 18:55
  • @madreflection, thank you for the note about the **System.Globalization.StringInfo**. Could you, please, add your last comment as an answer? Also, would you be so kind to provide the definition for the **combining character sequence** as well? – qqqqqqq Jan 13 '20 at 18:56
  • 1
    Incidentally, the remarks also confirm your guess about surrogate pairs. – madreflection Jan 13 '20 at 19:10
  • @madreflection, maybe you know what is the difference between a character and a text element? Because, the chapter in the **CLR via C#** book is called **Examining a String’s Characters and Text Elements**, implying that a character and a text element are not the same things. – qqqqqqq Jan 13 '20 at 19:15
  • @madreflection, from the definition I see that the text elements are displayed as characters. Does it mean that we can call the text element in other words as an internal characters representation? – qqqqqqq Jan 13 '20 at 19:17

1 Answers1

3

Under Remarks, the documentation you linked says:

.NET defines a text element as a unit of text that is displayed as a single character, that is, a grapheme. A text element can be a base character, a surrogate pair, or a combining character sequence.

I'm no expert but an example of a combining character sequence would be a letter followed by a combining diacritic, such as a combined with ´ (combining acute accent) creates á.

madreflection
  • 4,744
  • 3
  • 19
  • 29
  • [This](https://stackoverflow.com/a/1732454/477420) is a good example of those "combining characters" (in addition to being good educational resource :) ) – Alexei Levenkov Jan 13 '20 at 19:30