0

I have a C# endpoint that takes rawText as string input. The input is send after converting a file to string using 3rd party aspose library, input that is sent is of following format, eg -

{rawText = "\u0007\u0007\r\r\r\r\r\u0007Random Name\rRandom Address; Overland Park, KS 12345; Cell: 000-000-0000 Email: email1234@gmail.com"}

I know strings are UTF16 encoded in C#, so when it reaches the endpoint it is converted to -

requestobj.RawText = "\a\a\r\r\r\r\r\aRandom Name\r10504 Random Address; Overland Park, KS 12345; Cell: 000-000-0000 Email: email1234@gmail.com"

Is my reasoning correct that is due to C# strings being utf16 encoded? and what is the best way to can I remove the \a\a\r\r\r\r\r\a at string begining. I am passing this text to another 3rd party api which does not return correct result with this prepended extra text.

I have tried to use below, but I want a more generic solution for handling all possibilities of \n\r\a etc.

var newText = Regex.Replace(inputValue, @"\\a", "");
inputValue = inputValue.Replace(@"\a", "").Replace(@"\r", "");
s_v
  • 132
  • 1
  • 14
  • The question has nothing to do with *encoding* or Unicode. What you ask about is the escape sequences used to represent characters that are hard to type in source code or debugger output. The escape sequences don't exist in the actual string produced by the compiler, or displayed by he debugger. There's nothing special about them either, *every* character can be represented using an escape sequence. The ones you show are used in *many* programming languages and operating systems. – Panagiotis Kanavos Jul 11 '23 at 07:25
  • 1
    The text you show isn't converted to anything. The debugger displays the exact same string using a different escape sequence. Instead of the long form `\u0007` for the Alert character, it shows the short form `\a`. Both represent the same character and the same bytes. – Panagiotis Kanavos Jul 11 '23 at 07:30
  • @PanagiotisKanavos thanks. I solved it by adding this - '''Regex.Replace(inputValue, @"[^\u0000-\u007F]", String.Empty);''' – s_v Jul 11 '23 at 07:57
  • That range includes all English characters. Did you check the result? – Panagiotis Kanavos Jul 11 '23 at 08:01
  • I had added a ^ – s_v Jul 11 '23 at 08:03
  • That's still all English characters. You tried to replace all characters above the ASCII range. The control characters are *in* that range. Try `Regex.Escape` on the output to see what the actual result contains. [This fiddle](https://dotnetfiddle.net/Rc8ONO) shows that nothing was replaced – Panagiotis Kanavos Jul 11 '23 at 08:05

2 Answers2

2

Those are escape sequences, not UTF8 encoding. Encoding refers to how characters are converted to bytes. Escape sequences are used to enter characters that are hard to type or invisible in source code. They're also used by debuggers to display such characters. Nothing got converted in the question's case. The same BELL character (0x07) can be represented as both \a or \u0007. The debugger chose the shorter version.

To replace just these 3 characters at the start you can use this regular expression @"^[\r\n\a]+". To avoid double quoting the escape sequences in the regular expression, a verbatim string can be used which doesn't translate \ as an escape character.

var input="\a\a\r\r\r\r\r\aRandom Name\r10504 Random Address; Overland Park, KS 12345; Cell: 000-000-0000 Email: email1234@gmail.com";
var pattern=@"^[\r\n\a]+";
var newText=Regex.Replace(input,pattern,"");

This produces

Random Name 10504 Random Address; Overland Park, KS 12345; Cell: 000-000-0000 Email: email1234@gmail.com

To remove characters at any position, remove the start anchor ^.

It's also possible to replace all control characters. There's a specific Unicode category for control characters with \p{Cc}. Cc is the shorthand for the control character category.

var pattern=@"\p{Cc}+";
var newText=Regex.Replace(input,pattern,"");

As the docs explain, this category matches any

Control code character, with a Unicode value of U+007F or in the range U+0000 through U+001F or U+0080 through U+009F. Signified by the Unicode designation "Cc" (other, control).

Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236
0

As Panagiotis pointed out the representation of escape codes in a string is simply about visual representation and doesn't change the meaning or the encoding of the string. Yes, C# (and .NET in general) uses Unicode/UTF-16 to encode the strings in memory, but that's neither relevant to your question nor important in most cases.

That aside, your main question seems to be this:

what is the best way to can I remove the \a\a\r\r\r\r\r\a at string begining.

As with most such questions there are a lot of ways to approach this. Regular expressions (as Panagiotis suggested) can certainly do the job, but they can be finicky and are often slower than more direct options. There are times when a regular expression is the best fit for a particular problem, but this isn't necessarily one of those times. I don't get the impression you're looking for the fastest possible solution... but it doesn't hurt to explore options.

So here are a couple of ideas.

If you're looking to remove a small number of known characters from the start of the string then there's a string method for that: TrimStart(). Specifically the version that accepts a set of characters to remove:

string cleanText = inputText.TrimLeft('\a', '\r', '\n');

That's fine for a small number of known characters. But if you're looking to remove any control character from the start of the string you can count them and skip that many characters from the string:

// Count control characters at the start of the string:
int count = 0;
for (; count < inputText.Length && Char.IsControl(inputText, count); count++)
{ }

// This monster is safe:
string cleanText = 
    count == 0 ? inputText : 
    count >= inputText.Length ? string.Empty :
    inputText[count..];

This happens to be one of the fastest methods to do that particular job, but it's not the prettiest. And unless you're doing this frequently you're probably not going to miss a few extra milliseconds each time.

And since performance isn't a critical issue, let me introduce you to one of the slowest options: LINQ.

string cleanText = new string(inputText.SkipWhile(c => char.IsControl(c)).ToArray());

While the performance on this is frankly terrible, it's quite a bit more readable than the high-perforance version. SkipWhile() skips items while the condition is met, the rest of the characters are collected into an array and used to create a new string. It's pretty but slow. Just like my cat.

Corey
  • 15,524
  • 2
  • 35
  • 68