Match non printable/non ascii characters and remove from text

Question

My JavaScript is quite rusty so any help with this would be great. I have a requirement to detect non printable characters (control characters like SOH, BS etc) as well extended ascii characters such as Ž in a string and remove them but I am not sure how to write the code?

Can anyone point me in the right direction for how to go about this? This is what I have so far:

$(document).ready(function() {
    $('.jsTextArea').blur(function() {
        var pattern = /[^\000-\031]+/gi;
        var val = $(this).val();
        if (pattern.test(val)) {    
        for (var i = 0; i < val.length; i++) {
            var res = val.charAt([i]);
                alert("Character " + [i] + " " + res);              
        }          
    }
    else {
         alert("It failed");
     }

    });
});

The `match` property should be called like so: `isNonAscii.match($(this).val())`. The program does not magically know that you want to match the value of the input with the regex. — SeinopSys, Jun 15 '14 at 11:59
Thanks for the input. Makes sense, but how I remove the invalid character that is detected from the string in the textbox? — Grant Doole, Jun 15 '14 at 12:08
I have decided to change my approach to this and go for a server side solution (since javascript can sometimes be turned off in the clients browser) — Grant Doole, Jun 25 '14 at 17:49
@GrantDoole: Don't invalidate existing answers by completely changing the code of your question. — Cerbrus, Apr 10 '17 at 06:47

score 91 · Accepted Answer · edited Dec 08 '18 at 07:56

91

To target characters that are not part of the printable basic ASCII range, you can use this simple regex:

[^ -~]+

Explanation: in the first 128 characters of the ASCII table, the printable range starts with the space character and ends with a tilde. These are the characters you want to keep. That range is expressed with [ -~], and the characters not in that range are expressed with [^ -~]. These are the ones we want to replace. Therefore:

result = string.replace(/[^ -~]+/g, "");

edited Dec 08 '18 at 07:56

Jonathan

6,741
7
52
69

answered Jun 15 '14 at 15:46

zx81

41,100
9
89
105

Hi, very good answers on all but and I am close to resolving. While the value.replace works very well, it's not exactly what I need. I will update the original post with what I have so far. – Grant Doole Jun 17 '14 at 20:13
This will, notably, replace newline/carriage return, so it won't work for multiline text. – Jonathan Dec 08 '18 at 07:59
2019 and this is still the most elegant solution I've yet encountered for this. Yes, it removed the newline, carriage return and tab characters, but for those actually trying to strip those, this solution is gorgeous and easily human readable. – JRad the Bad Apr 07 '19 at 23:51
1

Hi. FYI: This will not work for special characters like "şıç" (Turkish). Will replace them and break the word. – Canser Yanbakan Dec 30 '19 at 09:49
Yes the same for Korean characters too. – fraserh Jan 11 '22 at 16:07
This regex is missing delete, which is the last ASCII character, is non-printable, and is a control character. – Zamicol May 25 '22 at 17:36
@Zamicol Are you sure? `DEL` comes after `~`, and this regex matches everything outside of the range from `Space` to `~`. – Arthur Khazbs Feb 26 '23 at 22:47
If you need to keep tabs and newlines, this can be extended to `[^( -~)\n\r\t]+` – Jesse Mar 10 '23 at 16:09

Casimir et Hippolyte · Answer 2 · 2014-06-15T14:10:19.620

41

No need to test, you can directly process the text box content:

textBoxContent = textBoxContent.replace(/[^\x20-\x7E]+/g, '');

where the range \x20-\x7E covers the printable part of the ascii table.

Example with your code:

$('.jsTextArea').blur(function() {
    this.value = this.value.replace(/[^\x20-\x7E]+/g, '');
});

edited Jun 15 '14 at 14:10

answered Jun 15 '14 at 12:23

Casimir et Hippolyte

88,009
5
94
125

Thanks for the input but that won't work as the replace function only works with printable characters. The control characters such as BS, SOH, ACK etc are invisible and thus are not picked up with the .replace method. – Grant Doole Jun 15 '14 at 12:30
@GrantDoole: What a crazy idea! Just because a character is not printable doesn't mean that the replace method will not find it! The replace method works with any character (printable or not). – Casimir et Hippolyte Jun 15 '14 at 12:44
Really? That's strange because I just tested it and it didn't work? Are you able to show me? – Grant Doole Jun 15 '14 at 12:49
@GrantDoole: I will add a small test to my answer. – Casimir et Hippolyte Jun 15 '14 at 12:50
@GrantDoole: I have forgotten to put the g modifier, It is probably why you didn't obtain the expected result. – Casimir et Hippolyte Jun 15 '14 at 13:00
Still can't get it to work... Your solution worked great, but when I tried copying and pasting my text with special characters and triggered the blur event, nothing happens the the char is not removed. What am I doing wrong? – Grant Doole Jun 15 '14 at 13:31
This doesn't work for multiline text because it will replace linefeed carriage return =/ – Jonathan Dec 08 '18 at 07:57
Warning to anyone using this answer: This will strip out printable Unicode characters, basically anything above decimal 126 such as the degree or Yen symbols. – Ben Jun 15 '23 at 18:42
@Ben: the degree and Yen symbols aren't ascii characters. Read the title of the question. – Casimir et Hippolyte Jun 16 '23 at 09:22
@CasimiretHippolyte, Yes, I knew that when I left the remark. My *warning* is for people who grab this solution and don't consider that there are some common non-ascii characters that will be removed. Keep in mind I did not paste a *solution* but just left a remark. Myself, I try to read through the remarks before using a solution in case there are edge cases that I hadn't considered. I'm just trying to pay it forward. – Ben Jun 27 '23 at 14:31

score 4 · Answer 3 · edited Nov 18 '22 at 14:42

For anyone looking for a solution that works beyond ascii and does not strip out Unicode chars:

function stripNonPrintableAndNormalize(text) {
    // strip control chars
    text = text.replace(/\p{C}/gu, '');

    // other common tasks are to normalize newlines and other whitespace

    // normalize newline
    text = text.replace(/\n\r/g, '\n');
    text = text.replace(/\p{Zl}/gu, '\n');
    text = text.replace(/\p{Zp}/gu, '\n');

    // normalize space
    text = text.replace(/\p{Zs}/gu, ' ');

    return text;
}

The various unicode class identifiers (e.g. Zl for line separator) are defined at https://www.unicode.org/reports/tr44/#GC_Values_Table as also shown below:

Abbr	Long	Description
Lu	Uppercase_Letter	an uppercase letter
Ll	Lowercase_Letter	a lowercase letter
Lt	Titlecase_Letter	a digraphic character, with first part uppercase
LC	Cased_Letter	Lu \| Ll \| Lt
Lm	Modifier_Letter	a modifier letter
Lo	Other_Letter	other letters, including syllables and ideographs
L	Letter	Lu \| Ll \| Lt \| Lm \| Lo
Mn	Nonspacing_Mark	a nonspacing combining mark (zero advance width)
Mc	Spacing_Mark	a spacing combining mark (positive advance width)
Me	Enclosing_Mark	an enclosing combining mark
M	Mark	Mn \| Mc \| Me
Nd	Decimal_Number	a decimal digit
Nl	Letter_Number	a letterlike numeric character
No	Other_Number	a numeric character of other type
N	Number	Nd \| Nl \| No
Pc	Connector_Punctuation	a connecting punctuation mark, like a tie
Pd	Dash_Punctuation	a dash or hyphen punctuation mark
Ps	Open_Punctuation	an opening punctuation mark (of a pair)
Pe	Close_Punctuation	a closing punctuation mark (of a pair)
Pi	Initial_Punctuation	an initial quotation mark
Pf	Final_Punctuation	a final quotation mark
Po	Other_Punctuation	a punctuation mark of other type
P	Punctuation	Pc \| Pd \| Ps \| Pe \| Pi \| Pf \| Po
Sm	Math_Symbol	a symbol of mathematical use
Sc	Currency_Symbol	a currency sign
Sk	Modifier_Symbol	a non-letterlike modifier symbol
So	Other_Symbol	a symbol of other type
S	Symbol	Sm \| Sc \| Sk \| So
Zs	Space_Separator	a space character (of various non-zero widths)
Zl	Line_Separator	U+2028 LINE SEPARATOR only
Zp	Paragraph_Separator	U+2029 PARAGRAPH SEPARATOR only
Z	Separator	Zs \| Zl \| Zp
Cc	Control	a C0 or C1 control code
Cf	Format	a format control character
Cs	Surrogate	a surrogate code point
Co	Private_Use	a private-use character
Cn	Unassigned	a reserved unassigned code point or a noncharacter
C	Other	Cc \| Cf \| Cs \| Co \| Cn

score 1 · Answer 4 · answered Jun 15 '14 at 11:55

1

You have to assign a pattern (instead of string) into isNonAscii variable, then use test() to check if it matches. test() returns true or false.

$(document).ready(function() {
    $('.jsTextArea').blur(function() {
        var pattern = /[^\000-\031]+/gi;
        var val = $(this).val();
        if (pattern.test(val)) {
            alert("It matched");
        }
        else {
            alert("It did NOT match");
        }
    });
});

Check jsFiddle

answered Jun 15 '14 at 11:55

kosmos

4,253
1
18
36

Many thanks for the response, but how can I detect the invalid char, remove if from the string and replace the new string without the invalid char back in the textbox? – Grant Doole Jun 15 '14 at 12:02
Using `replace()` function should works at expected. You can do it directly instead of that piece of code. @CasimiretHippolyte's code works fine – kosmos Jun 15 '14 at 12:54

score -5 · Answer 5 · answered Jul 12 '14 at 10:47

For those who have this problem and are looking for a 'fix all' solution... This is how I eventually fixed it:

public static string RemoveTroublesomeCharacters(string inString)
{
    if (inString == null)
    {
        return null;
    }

    else
    {
        char ch;
        Regex regex = new Regex(@"[^\u0000-\u007F]", RegexOptions.IgnoreCase);
        Match charMatch = regex.Match(inString);

        for (int i = 0; i < inString.Length; i++)
        {
            ch = inString[i];
            if (char.IsControl(ch))
            {
                string matchedChar = ch.ToString();
                inString = inString.Replace(matchedChar, string.Empty);
            }
        }

        while (charMatch.Success)
        {
            string matchedChar = charMatch.ToString();
            inString = inString.Replace(matchedChar, string.Empty);
            charMatch = charMatch.NextMatch();
        }
    }       

    return inString;
}

I'll break it down a bit more detail for those less experienced:

We first loop through every character of the entire string and use the IsControl method of char to determine if a character is a control character or not.
If a control character is found, copy that matched character to a string then use the Replace method to change the control character to an empty string. Rinse and repeat for the rest of the string.
Once we have looped through the entire string we then use the regex defined (which will match any character that is not a control character or standard ascii character) and again replace the matched character with an empty string. Doing this in a while loop means that all the time charMatch is true the character will be replaced.
Finally once all characters are removed and we have looped the entire string we return the inString.

(Note: I have still not yet managed to figure out how to repopulate the TextBox with the new modified inString value, so if anyone can point out how it can be done that would be great)

You have perfectly valid answers here and your solution is based on them. Also \u0000-\u0020 are control characters. — Zlatin Zlatev, Oct 17 '16 at 16:08
And who says that he shouldn't outline how he used the other answers (one of which is marked as accepted) to finally solve his problem? One could argue that the purpose of SO is to achieve such an outcome. — m12lrpv, Aug 24 '22 at 23:00

Match non printable/non ascii characters and remove from text

5 Answers5

Linked