How can two visually identical bits of text be different to the clipboard?

Question

I have a sublime document with two identical file paths (2 seperate lines), if I copy one my app functionality works, if I copy the other it does not.

When I select one line and do cmd + d you would expect sublime to highlight both lines, as per normal functionality. It does not. This is also true in VC code, so something is different about these two lines.

I have tried myData.toString() I tried JSON.parse but it didn't go well I couldn't figure it out

Here at the offending lines.

/Volumes/Macintosh HD/Archive/Work/AE_Scripting/⁨Resources⁩/⁨CEP-Resources-master⁩/⁨CEP_8.x⁩/⁨Documentation

-Works
/Volumes/Macintosh HD/Archive/Work/AE_Scripting/Resources/CEP-Resources-master/CEP_8.x/Documentation

Upon uploading an example file for this post I have now some new information, as you can see here

http://gravitystaging.com/uploadarea/test/examplefile.txt

Both lines now appear as

/Volumes/Macintosh HD/Archive/Work/AE_Scripting/â¨Resourcesâ©/â¨CEP-Resources-masterâ©/â¨CEP_8.xâ©/â¨Documentation

-Works
/Volumes/Macintosh HD/Archive/Work/AE_Scripting/Resources/CEP-Resources-master/CEP_8.x/Documentation

Although in any editor they look normal and identical. So how can I process this string to remove this.

Looks like the line was encoded in a different format than the second line. Have you edited the file with different operating systems (e.g. Linux and Windows)? — Alberti Buonarroti, Feb 16 '19 at 22:07
I'm not too sure where the first line came from possibly terminal — Wiplash, Feb 16 '19 at 22:08

score 1 · Answer 1 · answered Feb 16 '19 at 22:27

Your first string has some Unicode bidirectional marking characters in it: U+2068 and U+2069. You can use the ord function in Python to check for these:

>>> [ord(x) for x in '/Volumes/Macintosh HD/Archive/Work/AE_Scripting/⁨Resources⁩/⁨CEP-Resources-master⁩/⁨CEP_8.x⁩/⁨Documentation']
[47, 86, 111, 108, 117, 109, 101, 115, 47, 77, 97, 99, 105, 110, 116, 111, 115, 104, 32, 72, 68, 47, 65, 114, 99, 104, 105, 118, 101, 47, 87, 111, 114, 107, 47, 65, 69, 95, 83, 99, 114, 105, 112, 116, 105, 110, 103, 47, 8296, 82, 101, 115, 111, 117, 114, 99, 101, 115, 8297, 47, 8296, 67, 69, 80, 45, 82, 101, 115, 111, 117, 114, 99, 101, 115, 45, 109, 97, 115, 116, 101, 114, 8297, 47, 8296, 67, 69, 80, 95, 56, 46, 120, 8297, 47, 8296, 68, 111, 99, 117, 109, 101, 110, 116, 97, 116, 105, 111, 110]

See the ones that are 8000-something? Those are the Unicode markers you don't want.

If you just want plain ASCII, here's how I would do that in Python:

''.join(c for c in my_string if ord(c) < 256)

This strips out anything higher than U+00FF.

score 0 · Answer 2 · answered Feb 16 '19 at 22:08

0

I'd recommend taking a look at using regex to remove all non-alphanumeric characters.

See https://stackoverflow.com/a/7225734/9899022

Since the pasted text and additional characters are already in string format, attempting to parse it to JSON or calling .toString() won't change anything about the variable.

answered Feb 16 '19 at 22:08

Ryan Fleck

158
7

Thank you. It looks like regex is the answer. It's going to take a while to decipher the voodoo. – Wiplash Feb 16 '19 at 22:21

Alberti Buonarroti · Answer 3 · 2019-02-16T22:51:48.023

0

If you cat your file in a (MacOS) bash terminal you will get identical lines. Running encguess examplefile.txt will tell you the format is UTF-8. Opening in it in SublimeText 3 with UTF-8 encoding will also show you identical lines.

But if you switch to Western (Windows 1252) encoding then you will get the exact same wrong symbols as in your example. So I guess you are using the wrong encoding to view your file.

How to switch encoding in SublimeText 3: File => Reopen With Encoding => Choose your Encoding (UTF-8)

Edit
If you want to remove the wrong characters from your given string, you can use String.replace().

str = "/Volumes/Macintosh HD/Archive/Work/AE_Scripting/â¨Resourcesâ©/â¨CEP-Resources-masterâ©/â¨CEP_8.xâ©/â¨Documentation"

console.log("Before: ", str);

str = str.replace(/(â©)|(â¨)/g, "");
console.log("After: ", str);

edited Feb 16 '19 at 22:51

answered Feb 16 '19 at 22:19

Alberti Buonarroti

459
3
9

Thank you for the explanation, but I need to know how to convert the line with javascript. – Wiplash Feb 16 '19 at 22:20
Do you want to remove all the wrong chars using JavaScript? You could use a simple Regex to delete the unwanted characters `str.replace(/(â©)|(â¨)/g, "")` – Alberti Buonarroti Feb 16 '19 at 22:28
1

This is incorrect: the correct encoding is UTF-8, which makes the control characters invisible (as they should be). Interpreting it as Windows 1252 turns them into the mojibake. – Draconis Feb 16 '19 at 22:36
Thanks for the insight, updated my answer. Also, thanks for "mojibake" didn't know there is a name for it. It makes it easier to research! – Alberti Buonarroti Feb 16 '19 at 22:39
````str.replace(/(â©)|(â¨)/g, "")```` doesn't actually work, because I can't guarantee the encoding type of the text coming in. the string is set by electrons clipboard.readText() so regex doesn't see these characters. – Wiplash Feb 19 '19 at 22:04
I just answered to your specific problem; removing these characters. You can easily create a regex that only allows for valid characters. – Alberti Buonarroti Feb 21 '19 at 08:55

score 0 · Accepted Answer · answered Feb 22 '19 at 09:22

I managed to solve this with the following thread

How to remove invalid UTF-8 characters from a JavaScript string?

function cleanString(input) {
    var output = "";
    for (var i=0; i<input.length; i++) {
        if (input.charCodeAt(i) <= 127) {
            output += input.charAt(i);
        }
    }
    return output;
}

Its something I looked at early on but must have been using it incorrectly.

How can two visually identical bits of text be different to the clipboard?

4 Answers4