How to reliably strip invisible characters that break code?

Question

I am trying to build a bookmarklet and got slammed with this issue which I was just able to figure out: a \u8203 character, which Chrome unhelpfully tells me in my block of code (upon pasting into the JS console) is an `"Invalid character ILLEGAL".

Luckily Safari was the one that told me it was a \u8203.

I am editing the code in the Sublime Text 2 editor and somehow copying in and out of it (I also tried TextEdit) fails to remove it.

Is there some sort of website somewhere that will strip all characters other than ASCII?

When I try to save as ISO 8859 but it will save it back as UTF-8 "because of unsupported characters".

... Yeah. that's the point. Get rid of my unsupported evil characters.

What am I supposed to do? Edit my file in a hex editor?

FYI I actually solved it by re-typing the code (which originated from this site by the way).

I just did some Googling and found [this](http://www.perlmonks.org/?node_id=619792) and [this](http://stackoverflow.com/questions/1176904/php-how-to-remove-all-non-printable-characters-in-a-string) — Adi, Jul 19 '12 at 05:49
How about something that processes my clipboard. Or a website with a set of text inputs that I can copy/paste with. — Steven Lu, Jul 19 '12 at 22:41
I don't think that's possible with Javascript only (I'm assuming this is what you're using, because of the tag in your question). You can, however, write a small Javascript script with a little help of Flash (I believe there are ready tools for that) that will read the clipboard then do the RegEx replacement then write to the clipboard again. — Adi, Jul 20 '12 at 04:52
I'm sure it's easy to make a loop in js that filters chars in 1-127 ASCII range. — Steven Lu, Jul 20 '12 at 12:58
Wait wait, are we talking about characters in a string? or characters in your code itself, like `if[*] (true){}` where `*` is the invisible char? — Adi, Jul 20 '12 at 13:00
Characters in general. The code I write tends to not require anything outside of ASCII. In fact the only characters I want to keep are the ones accessible on a QWERTY keyboard. Why would I write in a language that I can't type easily? Consider what happened to me: some invisible character (the `\u8203`) got stuck into my file and it follows the code into the clipboard. Including the js file normally is fine but when I paste the same exact code I just copied into the console I get "ILLEGAL CHARACTER OMGWTF" from the browser without a line number. — Steven Lu, Jul 20 '12 at 14:32
> Is there some sort of website somewhere that will strip all characters > other than ASCII? You could use [this website](http://jsfiddle.net/n9PNs/) — Esailija, Jul 21 '12 at 12:44

Esailija · Answer 1 · 2012-08-08T13:27:22.903

Is there some sort of website somewhere that will strip all characters other than ASCII?

You could use this website

You can recreate the website using this code:

<!DOCTYPE html>
<html>

    <head>
        <meta http-equiv="content-type" content="text/html; charset=UTF-8">
        <title>- jsFiddle demo</title>
        <script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js"></script>
        <link rel="stylesheet" type="text/css" href="/css/normalize.css">
        <link rel="stylesheet" type="text/css" href="/css/result-light.css">
        <style type="text/css">
            textarea {
                width: 800px;
                height: 480px;
                outline: none;
                font-family: Monaco, Consolas, monospace;
                border: 0;
                padding: 15px;
                color: hsl(0, 0%, 27%);
                background-color: #F6F6F6;
            }
        </style>
        <script type="text/javascript">
            //<![CDATA[ 
            $(function () {
                $("button").click(function () {
                    $("textarea").val(
                             $("textarea").val().replace(/[^\u0000-\u007E]/g, "")
                    );
                    $("textarea").focus()[0].select();
                });
            }); //]]>
        </script>
    </head>

    <body>
        <textarea></textarea>
        <button>Remove</button>
    </body>

</html>

Thanks. Short and sweet. Maybe I'll implement this on my website. I'll add in some goodies like a report of which characters were dropped, and where they were. — Steven Lu, Jul 21 '12 at 15:38
moral of the story is don't copy from the Javascript buffer in jsfiddle. They use invisible characters to do strange things in there. — Steven Lu, Jul 26 '12 at 07:30
This is an answer, and it answers the question. but it is just a link that can become dead. Add your code to the answer so it can be useful for when that ever happens.... — Naftali, Aug 08 '12 at 13:19
Thank you very much for this. In our website, our JS codes have the "Â" character on every JS line. Filtering the code through what you provided worked by removing that character, however. Using it as is, killed the format of the code, so adding a single space keeps the format and removes the bad characters. This is my edited version of your project, with a single space. http://jsfiddle.net/carrzkiss/8pwkLxqa/1/ — Wayne Barron, Jun 10 '23 at 07:15

score 6 · Answer 2 · answered Sep 02 '15 at 22:14

6

you can use regex to filter everything out of 0-127. For example in javascript:

text.replace(/[^\x00-\x7F]/g, "")

x00 = 0, x7f = 127

answered Sep 02 '15 at 22:14

Matt Kim

759
7
19

1

This keeps only ascii character set, so removes all non-western unicode characters. We only want to remove control characters, not foreign letters. – mike nelson Dec 20 '16 at 19:27

Adi · Accepted Answer · 2012-07-19T05:44:26.453

4

Well, the easiest way I can think of is to use sed

sed -i 's/[^[:print:]]//g' your_script.js
//            ^^^^^ this can also be 'ascii'

or using tr

tr -cd '\11\12\15\40-\176' < old_script.js > new_script.js

edited Jul 19 '12 at 05:44

answered Jul 19 '12 at 05:32

Adi

5,089
6
33
47

will that even match the character (which isnt in the 128-255 ascii range)? – Steven Lu Jul 19 '12 at 05:35
@StevenLu, alright, think of it as white-listing. You wanna keep ONLY ASCII characters, so you don't really care about `\u8203`. I'll explain further in the answer. – Adi Jul 19 '12 at 05:41
@StevenLu, apparently I made a mistake. I matched the opposite of what you want. Note: you can do the same with `RegEx` of any language, do you have `php` or `perl` installation? – Adi Jul 19 '12 at 05:45
So I agree that `sed` or `tr` is a solid solution but what about when I am on windows? – Steven Lu Jul 20 '12 at 14:35
@dda well, yeah. That's what Adnan's original example did – Steven Lu Jul 20 '12 at 14:35
I repeat. ASCII is 0 to 127. ASCII doesn't have codepoints above that. So the expression you used, `in the 128-255 ascii range` doesn't make sense. – dda Jul 20 '12 at 15:20
@dda You are correct. I conflated ASCII with the vague notion of "8 bit characters". – Steven Lu Jul 21 '12 at 15:37
I switched my accept to this answer because depending on a webpage and various copy-paste clipboard shenanigans is slightly asinine. It's really better to keep things simple, and the best way to keep it simple is to have a small command line utility that can be used to scan a source code file for rogue characters. Then one can and *should* use a hex editor to do the dirty business. – Steven Lu Aug 18 '14 at 22:54

score 0 · Answer 4 · answered May 27 '15 at 19:28

0

Nontechnical solution: paste your text into a new email message in Gmail and click Tx (clear formatting, in the formatting menu). Worked for me.

answered May 27 '15 at 19:28

ERM

1

You can do the same with something like Notepad.exe. I tend to do this but with Vim (I might paste into Sublime Text, then save as file, then open from Vim) – Steven Lu May 27 '15 at 20:08

How to reliably strip invisible characters that break code?

4 Answers4

Linked