0

I am working on creating a Table of Contents for Google slide decks. I have solved many problems, but one slide I found (someone else's) results in a seemingly blank text entry in the Table of Contents. If I copy the entry into NOTEPAD it looks like a square with a question mark in it. I have read in multiple places this is an unprintable character. I would like to include all printable characters in the Table of Contents no matter what language they are in. I also want to preserve things like trademark and copyright symbols. I expect some people will include emojis in their slides but I have not tested that yet. If they pass through visible in the Table of Contents that will be fine.

These are the things I have tried to remove unprintable character(s). My mystery character is getting through.

let beforeTxt = txtBack;
txtBack = beforeTxt.replace(/[^0-9a-z\u0600-\u06FF]/gi, " "); // reserves Arabic characters https://stackoverflow.com/questions/9364400/remove-not-alphanumeric-characters-from-string
if (beforeTxt != txtBack)
  console.log("1 + + + + + + + hidden char in text: ; ", beforeTxt);

beforeTxt = txtBack;
txtBack = beforeTxt.replace("/[^0-9a-z\u0600-\u06FF]/gi", " "); // reserves Arabic characters https://stackoverflow.com/questions/9364400/remove-not-alphanumeric-characters-from-string
if (beforeTxt != txtBack)
  console.log("2 + + + + + + + hidden char in text: ; ", beforeTxt);

beforeTxt = txtBack;
txtBack = beforeTxt.replace("[^\x00-\x7F]/", " "); // replace unprintable char with space
if (beforeTxt != txtBack)
  console.log("3 + + + + + + + hidden char in text: ; ", beforeTxt);

beforeTxt = txtBack;
txtBack = beforeTxt.replace("[^\x00-\x7F]/", "gi", " "); // replace unprintable char with space
if (beforeTxt != txtBack)
  console.log("4 + + + + + + + hidden char in text: ; ", beforeTxt);

beforeTxt = txtBack;
// this invisible character looks like a question mark in a box if copied into notepad
txtBack = beforeTxt.replace("", " "); // replace unprintable char with space
if (beforeTxt != txtBack)
  console.log("5 + + + + + + + hidden char in text: ; ", beforeTxt);

Am I doing this incorrectly? There is no limit to the number of silly things people might include on slides. The thing I want is for the text in the Table of Contents to be visible.

aNewb
  • 188
  • 1
  • 12

1 Answers1

0

You can use String.charCodeAt() to try and identify the character.

So if you know the location of the character you can:

// you already have a variable "stringWithUnknownChar"

let unknownChar = stringWithUnknownChar[5] // if the char is at index 5

let unknownCharCode = stringWithUnknownChar.charCodeAt(5)

This is assuming that you don't have any way to find out what the original character was, do you?

If the source already contained this Unicode block then its likely that the original character encoding was lost, since the "unknown character" is rendered as U+FFFx irrespective of what it was, and so, when you copy it, you are just copying the code for "unknown character".

If this is the case then unfortunately there would be no way to render this character because there is no reference to what it was.

EDIT:

Based on your comment, you might go through the different replacement characters and find out what char code corresponds with them in JavaScript so you can filter them out in the way you have started to approach it above. For instance

console.log("�".charCodeAt(0)) // gives 65533
console.log("".charCodeAt(0)) // gives 56319

So then you could do something like this:

let txtBack = beforeTxt.replace(String.fromCharCode(65533)," ");

If those codes aren't right you can try with other replacement characters from the linked Wikipedia article.

Source

iansedano
  • 6,169
  • 2
  • 12
  • 24
  • The slide deck I was practicing on has over 300 slides. I am grabbing text in various ways depending on what someone put on the slide. I do not know until I look at the TOC that something I picked up is unprintable. It does not matter what it was. If I can turn it into a space I can skip a space filled entry when building the TOC. – aNewb Mar 25 '21 at 16:31
  • Thanks for the clarification @aNewb , updated my answer with an approach you could take, does that work? – iansedano Mar 25 '21 at 17:09
  • That is strange! Interesting...so what is the actual character that is `NaN` char code? What does it look like in the presentation? – iansedano Mar 26 '21 at 15:40
  • The offending code was 59655 but I added logic to remove all three. Thanks very much. – aNewb Mar 26 '21 at 15:45