8

I am working on an open jquery library jspdf.The above library does not support UTF-8 characters. Is there any way so that i can remove all the quotes UTF-8 character in my html string by using regex or any other method.

PSEDO CODE:

$(htmlstring).replace("utf-8 quotes character" , "") 
  • 2
    You seriously have a **javascript** library that **does not** support UTF-8 ? – adeneo Jul 30 '14 at 17:17
  • yes it's jspdf library you can search it [HERE IT IS](https://github.com/MrRio/jsPDF/issues/12) –  Jul 30 '14 at 17:20
  • would you please provide me some solution such that i can remove utf-8 characters from my html string without much effecting it –  Jul 30 '14 at 17:22
  • Are your trouble same as [this][1]? [1]: http://stackoverflow.com/questions/2145988/how-do-i-do-string-replace-in-javascript-to-convert-9-61-to-961 – Miranda Jul 30 '14 at 17:22
  • Well that sucks, you would think UTF-8 support is a minimum requirement, good thing I never used jsPDF, that simply doesn't cut it for most websites. – adeneo Jul 30 '14 at 17:24
  • would you provide me a regex to replace utf-8 quotes character only –  Jul 30 '14 at 17:30

2 Answers2

10

First off: I urge you to stop using jsPDF if it doesn't support Unicode. It's mid 2014, and the lack of support should have meant the death of the project two years ago. But that's just my personal conviction and not part of the answer you're looking for.

If jsPDF only supports ANSI (a 255 character block, rather than ASCII's 127 character block), then you can simply do a regex replace for everything above \xFF:

"lolテスト".replace(/[\u0100-\uFFFF]/g,'');
// gives us "lol"

If you only want to get rid of quotation marks (but leave in potentially jsPDF breaking unicode), you can use the pattern for "just quotation marks" based on where they live in the unicode map:

string.replace(/[\u2018-\u201F\u275B-\u275E]/g, '')

will catch ['‘','’','‚','‛','“','”','„','‟','❛','❜','❝','❞'], although of course what you probably want to do is replace them with the corresponding safe character instead. Good news: just make a replacement array for the list just presented, and work with that.

2017 edit:

ES6 introduced a new pattern for unicode strings in the form of the \u{...} pattern, which can do "any number of hexdigits" inside the curly braces, so a full Unicode 9 compatible regexp would now be:

// we can't use these in a regexp directly, unfortunately
start = `\u{100}`;
end = `\u{10FFF0}`;
searchPattern = new RegExp(`[${start}-${end}]`,`g`);
c = `lolテスト`.replace(searchPattern, ``);
Mike 'Pomax' Kamermans
  • 49,297
  • 16
  • 112
  • 153
  • 1
    A little clarification. Once we have a JavaScript string we no longer have UTF-8 (or ISO-8859-1 or whatever encoding the file is saved as): JavaScript makes a transparent conversion to its internal encoding (UCS-2 or UTF-16, the engine can choose). Good news is that we don't need to think about encodings any more, we can refer to characters by their `\u` escape sequence, which is basically its universal Unicode code point. Bad news is that JavaScript will split characters beyond 0xFFFF due to incomplete Unicode support. – Álvaro González Jul 30 '14 at 18:00
  • thanks for you advice , but i needed this just for a small work , i just want to know how to remove utf quotes character only(, ’ and other) –  Jul 30 '14 at 18:02
  • @SomPathak you can, but any remaining non-ansi unicode's still going to break jsPDF. Simply find out the specific unicode number for your quote symbols, and use a straight up patter like `/[\u2018-\u201F\u275B-\u275E]/g` – Mike 'Pomax' Kamermans Jul 30 '14 at 18:05
  • ... or do a simple `.replace(/[«»]/g, '')`. The problem with Unicode characters belongs to the library, not JavaScript itself. – Álvaro González Jul 30 '14 at 18:10
  • @ÁlvaroG.Vicario what would be regex to replace all utf-8 code –  Jul 30 '14 at 18:17
  • fun fact, « and » are single byte, and not a problem in this case (\uAB and \uBB) – Mike 'Pomax' Kamermans Jul 30 '14 at 18:26
  • @Mike'Pomax'Kamermans Of course, it was just an example, we don't know exactly what "utf-8 quotes character" stands for. – Álvaro González Jul 30 '14 at 19:11
  • @SomPathak Replace all UTF-8 code? `htmlstring = "";`, because UTF-8 includes all characters that exist. Do you have a clear idea so far of what you want to remove? – Álvaro González Jul 30 '14 at 19:14
  • yes, yes, hilarious, but let's not go literal because someone uses "unicode" wrong =) His original post is pretty clear in that he wants higher unicode quotes replaced. – Mike 'Pomax' Kamermans Jul 31 '14 at 02:38
3

use

$(htmlstring).replace(/[^\x00-\x7F]/g,'')

to remove all non-ascii charakter

(via regex-any-ascii-character)

Community
  • 1
  • 1
Valerij
  • 27,090
  • 1
  • 26
  • 42