Parse tweet with special characters

Question

I am trying to parse tweets in order to add links on hashtags, mentions, etc. But when a tweet contain a special character like "", this character is count as 2, for exemple :

console.log("it was fantastic. Thanks a lot guys ".length);

This line return 38, not 37. So when I use "indices" from the entities of twitter api, everything is shifted.

Is there any solution to avoid this ?

Thanks in advance!

The encoding is probably coming in as UTF-8, and the console font can't display that character. Try displaying it on screen in a font that supports those special chars. — brandonscript, Dec 24 '13 at 01:01
@r3mus: the question is about character indexes & the string length, not its display representation — zerkms, Dec 24 '13 at 01:03
The string length is related to the characters in the string. That will be a multibyte character, which while appears as one, takes up the length of two. Unless I'm much mistaken by the phrase " A multibyte character is a character whose bit representation fits into more than one byte. " — Popnoodles, Dec 24 '13 at 01:04
@popnoodles: why do you think so? `console.log('привет'.length);` — zerkms, Dec 24 '13 at 01:05
@r3mus is right! For example with Vietnamese character aa = â. You only see the one character but takes up the length of two. — Ringo, Dec 24 '13 at 01:05
On a non-crazy note, have you tested that the twitter char count doesn't treat special utf-8 chars as 2 as well? — brandonscript, Dec 24 '13 at 01:06
Btw, `` - is 4 bytes long. Seems like `.length` has some issues with that long characters. The 3 byte characters are measured correctly: `'䀹'.length == 3` — zerkms, Dec 24 '13 at 01:07
@Florent yeah, that's what I was wondering. Wasn't sure if they handled it correctly. Might take a look at the javascript source on their compose box to see how they're handling it? — brandonscript, Dec 24 '13 at 01:11
Heh, or see: http://stackoverflow.com/questions/2848462/count-bytes-in-textarea-using-javascript -- particularly @Tgr's answer — brandonscript, Dec 24 '13 at 01:12
`function (a,c){c||(c={short_url_length:22,short_url_length_https:23});var d=b.txt.getUnicodeTextLength(a),e=b.txt.extractUrlsWithIndices(a);b.txt.modifyIndicesFromUTF16ToUnicode(a,e);for(var f=0;f — zerkms, Dec 24 '13 at 01:17

score 2 · Answer 1 · answered Dec 24 '13 at 04:09

You need to handle surrogate pairs - characters comprised of two pseudo-characters. Check out: https://github.com/eller86/surrogate-pair.js. It looks a bit outdated, but gets the job done:

var sp = require('surrogate-pair')

console.log(sp.countCodePoints('it was fantastic. Thanks a lot guys ')) // 37

But remember, this introduces significant overhead. Honestly, I don't think it is required to know utf8 length to make any kinds of parsing. If there is no chance you can break surrogate pair - you are good.

Thanks. But why does Twitter return text using surrogate characters, instead of standard unicode, utf-8 encoded? — nealmcb, Jan 30 '22 at 17:27

Parse tweet with special characters

1 Answers1