C# strings are always Unicode (UTF-16) so if you can load the text without issue it is already Unicode. If you aren't getting the text you expect then you need to look into encodings and how you are reading the text.
Based on Unicode subscripts and superscripts superscripts aren't in a continuous block which makes them difficult to detect. The easiest way to see if you have a superscript it therefore to use a switch statement.
static bool IsSuperscript(char c)
{
switch(c)
{
case '⁰':
case '¹':
case '²':
case '³':
case '⁴':
case '⁵':
case '⁶':
case '⁷':
case '⁸':
case '⁹':
return true;
default:
return false;
}
}
Then to see if a string contains only superscript characters you just need to loop through it.
static bool IsSuperscript(string s)
{
foreach(var c in s)
{
if(!IsSuperscript(c))
{
return false;
}
}
return true;
}
If you want to convert a superscript character into a normal number character you can use a similar switch statement.
static bool TryNormalizeSuperscript(char superC, out char c)
{
bool result = true;
switch (superC)
{
case '⁰':
c = '0';
break;
case '¹':
c = '1';
break;
case '²':
c = '2';
break;
case '³':
c = '3';
break;
case '⁴':
c = '4';
break;
case '⁵':
c = '5';
break;
case '⁶':
c = '6';
break;
case '⁷':
c = '7';
break;
case '⁸':
c = '8';
break;
case '⁹':
c = '9';
break;
default:
c = '\0';
result = false;
break;
}
return result;
}
and loop
static string NormalizeSuperscript(string s)
{
var sb = new StringBuilder();
foreach (var superC in s)
{
if(TryNormalizeSuperscript(superC, out char c))
{
sb.Append(c);
}
else
{
break;
}
}
return sb.ToString();
}
Note that this loop stops at the first non-superscript character it finds. Depending on your use case that may need to change.
Example usage:
static void Main(string[] args)
{
Console.OutputEncoding = System.Text.Encoding.Unicode;
var superscripts = "⁰ ¹ ² ³ ⁴ ⁵ ⁶ ⁷ ⁸ ⁹ ¹⁰ ¹¹ ¹² ¹³ ¹⁴ ¹⁵ ¹⁶ 17 18 19 XX XXI XXII XXIII XXIV";
foreach(var superscript in superscripts.Split(' '))
{
Console.WriteLine($"{superscript} ({IsSuperscript(superscript)}) -> {NormalizeSuperscript(superscript)}");
}
}
Outputs:
⁰ (True) -> 0 ¹ (True) -> 1 ² (True) -> 2 ³ (True) -> 3 ⁴ (True) -> 4
⁵ (True) -> 5 ⁶ (True) -> 6 ⁷ (True) -> 7 ⁸ (True) -> 8 ⁹ (True) -> 9
¹⁰ (True) -> 10 ¹¹ (True) -> 11 ¹² (True) -> 12 ¹³ (True) -> 13 ¹⁴
(True) -> 14 ¹⁵ (True) -> 15 ¹⁶ (True) -> 16 17 (False) -> 18 (False)
-> 19 (False) -> XX (False) -> XXI (False) -> XXII (False) -> XXIII (False) -> XXIV (False) ->
Note that the Console.OutputEncoding = System.Text.Encoding.Unicode;
is required to get the console to show the correct characters. I also had to play with console fonts to get things to display correctly.