0

I have a sublime document with two identical file paths (2 seperate lines), if I copy one my app functionality works, if I copy the other it does not.

When I select one line and do cmd + d you would expect sublime to highlight both lines, as per normal functionality. It does not. This is also true in VC code, so something is different about these two lines.

I have tried myData.toString() I tried JSON.parse but it didn't go well I couldn't figure it out

Here at the offending lines.

/Volumes/Macintosh HD/Archive/Work/AE_Scripting/⁨Resources⁩/⁨CEP-Resources-master⁩/⁨CEP_8.x⁩/⁨Documentation

-Works
/Volumes/Macintosh HD/Archive/Work/AE_Scripting/Resources/CEP-Resources-master/CEP_8.x/Documentation

Upon uploading an example file for this post I have now some new information, as you can see here

http://gravitystaging.com/uploadarea/test/examplefile.txt

Both lines now appear as

/Volumes/Macintosh HD/Archive/Work/AE_Scripting/â¨Resourcesâ©/â¨CEP-Resources-masterâ©/â¨CEP_8.xâ©/â¨Documentation

-Works
/Volumes/Macintosh HD/Archive/Work/AE_Scripting/Resources/CEP-Resources-master/CEP_8.x/Documentation

Although in any editor they look normal and identical. So how can I process this string to remove this.

Wiplash
  • 55
  • 6

4 Answers4

1

Your first string has some Unicode bidirectional marking characters in it: U+2068 and U+2069. You can use the ord function in Python to check for these:

>>> [ord(x) for x in '/Volumes/Macintosh HD/Archive/Work/AE_Scripting/⁨Resources⁩/⁨CEP-Resources-master⁩/⁨CEP_8.x⁩/⁨Documentation']
[47, 86, 111, 108, 117, 109, 101, 115, 47, 77, 97, 99, 105, 110, 116, 111, 115, 104, 32, 72, 68, 47, 65, 114, 99, 104, 105, 118, 101, 47, 87, 111, 114, 107, 47, 65, 69, 95, 83, 99, 114, 105, 112, 116, 105, 110, 103, 47, 8296, 82, 101, 115, 111, 117, 114, 99, 101, 115, 8297, 47, 8296, 67, 69, 80, 45, 82, 101, 115, 111, 117, 114, 99, 101, 115, 45, 109, 97, 115, 116, 101, 114, 8297, 47, 8296, 67, 69, 80, 95, 56, 46, 120, 8297, 47, 8296, 68, 111, 99, 117, 109, 101, 110, 116, 97, 116, 105, 111, 110]

See the ones that are 8000-something? Those are the Unicode markers you don't want.

If you just want plain ASCII, here's how I would do that in Python:

''.join(c for c in my_string if ord(c) < 256)

This strips out anything higher than U+00FF.

Draconis
  • 3,209
  • 1
  • 19
  • 31
0

I'd recommend taking a look at using regex to remove all non-alphanumeric characters.

See https://stackoverflow.com/a/7225734/9899022

Since the pasted text and additional characters are already in string format, attempting to parse it to JSON or calling .toString() won't change anything about the variable.

Ryan Fleck
  • 158
  • 7
  • Thank you. It looks like regex is the answer. It's going to take a while to decipher the voodoo. – Wiplash Feb 16 '19 at 22:21
0

If you cat your file in a (MacOS) bash terminal you will get identical lines. Running encguess examplefile.txt will tell you the format is UTF-8. Opening in it in SublimeText 3 with UTF-8 encoding will also show you identical lines.

But if you switch to Western (Windows 1252) encoding then you will get the exact same wrong symbols as in your example. So I guess you are using the wrong encoding to view your file.

How to switch encoding in SublimeText 3: File => Reopen With Encoding => Choose your Encoding (UTF-8)

Edit
If you want to remove the wrong characters from your given string, you can use String.replace().

str = "/Volumes/Macintosh HD/Archive/Work/AE_Scripting/â¨Resourcesâ©/â¨CEP-Resources-masterâ©/â¨CEP_8.xâ©/â¨Documentation"

console.log("Before: ", str);

str = str.replace(/(â©)|(â¨)/g, "");
console.log("After: ", str);
  • Thank you for the explanation, but I need to know how to convert the line with javascript. – Wiplash Feb 16 '19 at 22:20
  • Do you want to remove all the wrong chars using JavaScript? You could use a simple Regex to delete the unwanted characters `str.replace(/(â©)|(â¨)/g, "")` – Alberti Buonarroti Feb 16 '19 at 22:28
  • 1
    This is incorrect: the correct encoding is UTF-8, which makes the control characters invisible (as they should be). Interpreting it as Windows 1252 turns them into the mojibake. – Draconis Feb 16 '19 at 22:36
  • Thanks for the insight, updated my answer. Also, thanks for "mojibake" didn't know there is a name for it. It makes it easier to research! – Alberti Buonarroti Feb 16 '19 at 22:39
  • Sorry does that mean ````str.replace(/(â©)|(â¨)/g, "")```` this line would work? – Wiplash Feb 16 '19 at 22:43
  • ````str.replace(/(â©)|(â¨)/g, "")```` doesn't actually work, because I can't guarantee the encoding type of the text coming in. the string is set by electrons clipboard.readText() so regex doesn't see these characters. – Wiplash Feb 19 '19 at 22:04
  • I just answered to your specific problem; removing these characters. You can easily create a regex that only allows for valid characters. – Alberti Buonarroti Feb 21 '19 at 08:55
0

I managed to solve this with the following thread

How to remove invalid UTF-8 characters from a JavaScript string?

function cleanString(input) {
    var output = "";
    for (var i=0; i<input.length; i++) {
        if (input.charCodeAt(i) <= 127) {
            output += input.charAt(i);
        }
    }
    return output;
}

Its something I looked at early on but must have been using it incorrectly.

Wiplash
  • 55
  • 6