0

I am trying to parse tweets in order to add links on hashtags, mentions, etc. But when a tweet contain a special character like "", this character is count as 2, for exemple :

console.log("it was fantastic. Thanks a lot guys ".length);

This line return 38, not 37. So when I use "indices" from the entities of twitter api, everything is shifted.

Is there any solution to avoid this ?

Thanks in advance!

  • The encoding is probably coming in as UTF-8, and the console font can't display that character. Try displaying it on screen in a font that supports those special chars. – brandonscript Dec 24 '13 at 01:01
  • @r3mus: the question is about character indexes & the string length, not its display representation – zerkms Dec 24 '13 at 01:03
  • 1
    The string length is related to the characters in the string. That will be a multibyte character, which while appears as one, takes up the length of two. Unless I'm much mistaken by the phrase " A multibyte character is a character whose bit representation fits into more than one byte. " – Popnoodles Dec 24 '13 at 01:04
  • @zerkms WOW.. I must be crazy today. 8~\ – brandonscript Dec 24 '13 at 01:04
  • @popnoodles: why do you think so? `console.log('привет'.length);` – zerkms Dec 24 '13 at 01:05
  • @r3mus is right! For example with Vietnamese character aa = â. You only see the one character but takes up the length of two. – Ringo Dec 24 '13 at 01:05
  • On a non-crazy note, have you tested that the twitter char count doesn't treat special utf-8 chars as 2 as well? – brandonscript Dec 24 '13 at 01:06
  • Btw, `` - is 4 bytes long. Seems like `.length` has some issues with that long characters. The 3 byte characters are measured correctly: `'䀹'.length == 3` – zerkms Dec 24 '13 at 01:07
  • @r3mus "" is 1 char, not 2 for twitter's compose box –  Dec 24 '13 at 01:09
  • @Florent yeah, that's what I was wondering. Wasn't sure if they handled it correctly. Might take a look at the javascript source on their compose box to see how they're handling it? – brandonscript Dec 24 '13 at 01:11
  • 1
    Heh, or see: http://stackoverflow.com/questions/2848462/count-bytes-in-textarea-using-javascript -- particularly @Tgr's answer – brandonscript Dec 24 '13 at 01:12
  • @r3mus: how is it relevant? – zerkms Dec 24 '13 at 01:14
  • 1
    `function (a,c){c||(c={short_url_length:22,short_url_length_https:23});var d=b.txt.getUnicodeTextLength(a),e=b.txt.extractUrlsWithIndices(a);b.txt.modifyIndicesFromUTF16ToUnicode(a,e);for(var f=0;f – zerkms Dec 24 '13 at 01:17
  • @Man of Snow: what "is hex" means? – zerkms Dec 24 '13 at 01:18

1 Answers1

2

You need to handle surrogate pairs - characters comprised of two pseudo-characters. Check out: https://github.com/eller86/surrogate-pair.js. It looks a bit outdated, but gets the job done:

var sp = require('surrogate-pair')

console.log(sp.countCodePoints('it was fantastic. Thanks a lot guys ')) // 37

But remember, this introduces significant overhead. Honestly, I don't think it is required to know utf8 length to make any kinds of parsing. If there is no chance you can break surrogate pair - you are good.

vkurchatkin
  • 13,364
  • 2
  • 47
  • 55
  • Thanks. But why does Twitter return text using surrogate characters, instead of standard unicode, utf-8 encoded? – nealmcb Jan 30 '22 at 17:27