1

Having some issues removing an unknown bad character from a string. Its showing up as simply a box (indicating an un-displayable character in my chosen font).

I have tried multiple ways of removing it, most successful was using regex to remove anything that was not an allowed character. That worked. The issue is that there are many allowed characters, basically anything, and given the wide range of input this will see, I am likely unable to account for all of them. Also, performance needs to be speedy (its basically a scrolling console window.)

Is there any other way to format a string to remove these undisplable character?

I am using a WPF text box to display the text, and VB.net as the backend code.

EDIT: Forgot to add that the strings with the special characters cannot be copied to the clipboard from the text box. So I can't put it in another program and identify just what character it is.

Example here:

null null
  • 63
  • 5
  • 1
    How are these bad characters getting in there? And what makes it a "bad character"? – RBarryYoung Aug 27 '14 at 15:56
  • The bad characters are being added in by the device thats providing the data. I have no control over the formatting coming from it. What makes them bad is that my font can't display them, so they either show up as a placeholder or a space, and can't be copied to the clipboard. – null null Aug 27 '14 at 15:57
  • You can identify these character in your code to find out what they are. `For Each c As Char in MyString...` – Matt Wilko Aug 27 '14 at 15:59
  • Is it possible that your string is ASCII and the characters are utf-8? If so, look at the different answers in this question (it's C#, but the conversion to VB.NET is pretty simple): http://stackoverflow.com/questions/123336/how-can-you-strip-non-ascii-characters-from-a-string-in-c There's also this bit from the MSDN showing how to "clean" a string: http://msdn.microsoft.com/en-us/library/844skk0h(v=vs.110).aspx – valverij Aug 27 '14 at 16:00
  • 1
    Do you know *why* the device is adding them in? I ask because if this is some kind of modem (which tends to do this if the baud rate/frame synch isnt right), it can also add incorrect but printable characters. The solution then isnt to clean the input, but to fix the device interfacing. – RBarryYoung Aug 27 '14 at 16:01
  • The device is a DSLAM ( CO End DSL modem). I can't say for sure why they are being added, because I don't know exactly what type of character it is. – null null Aug 27 '14 at 16:04
  • If you have the string in VB what is stopping you from displaying the int of the character? What is stopping you from displaying the text in a TextBox you can copy to the clip board? It could just be control characters. If you can't tell what the character is then how do you expect to remove it? – paparazzo Aug 27 '14 at 16:28
  • Based on your example, they are surounding the first leter of a line. They might be special character that signify something. You could ask the owner of the data, ask them why they are sending this information. They could also just be dashes, if your using the wrong encoding, you won't see the character properly. – the_lotus Aug 27 '14 at 16:58
  • I finally solved it, I used a regex to delete everything in a range of ascii values and tighted it down until I found out which it was. Turns out they were ascii(0) null characters. Not sure exactly how they got there still. – null null Aug 27 '14 at 17:43

2 Answers2

1

The following regex will clean you string to strict ASCII character set.

string plainText = Clipboard.GetText(TextDataFormat.Text);

// Allow ASCII base - https://en.wikipedia.org/wiki/Basic_Latin_(Unicode_block)  for readability below or equivalently    https://en.wikipedia.org/wiki/ASCII

//non-printable characters range (\u0000-\u001F)

//remove "bad" non-printable control characters except Horizontal Tab, Line Feed, Carriage Return
//\uxxxx is unicode for the character, make 1st link easily translated. 

ascciiText = Regex.Replace(plainText, @"[\u0000-\u0008|\u000B-\u000C|\u000E-\u001F|\u0080-\u009F]", string.Empty);

//remove everything outside strict ASCII only range and delete control character (U+007F) \u007F which is Deleted character

asciiText = Regex.Replace(plainText, @"[^\u0000-\u007E]", string.Empty);
Paul Roub
  • 36,322
  • 27
  • 84
  • 93
Markus
  • 420
  • 3
  • 7
0

Turns out my issue was Ascii(0) null characters in my strings. The trouble I had was that the "ASC" function didn't seem to want to print them at all. I managed to track them down using a regex like the following [\x00-\x07] and using regex replace to replace all matching values with an empty string. I lessened the range until I found the correct character and replace only that.

I encourage anyone with a similar problem to consider using regex to match a set of strings.

null null
  • 63
  • 5