1

I want to truncate a piece of utf8 encoded text to a given length in bytes. For example, if the text is

Hello , I like rice cakes ¯\_(ツ)_/¯

I would like to truncate that text to 10 bytes max.

I found the truncate-utf8-bytes NPM module that does exactly what I need, unfortunately, the project I am working on doesn't use webpack or browerify so I cannot use those NPM modules as far as I'm aware

So I was wondering if there was a reliable way to truncate the text, or if there was a way for me to use the truncate-utf8-bytes module in the browser.

Thanks

  • Have you checked https://stackoverflow.com/questions/1515884/using-javascript-to-truncate-text-to-a-certain-size-8-kb? – Shinjo Sep 03 '19 at 10:09
  • @Shinjo Yes I have, but I read the solutions there are deprecated. Also I need solutions that take into account of multi-byte characters and surrogate pairs. – Uchenna Okafor Sep 03 '19 at 10:10
  • 1
    Really? It's working good FWIW: https://jsfiddle.net/0xmcauqw/ Maybe you can add your desired output and "Also I need solutions that take into account of multi-byte characters and surrogate pairs." what is your input and expected output? Also what is your current progress, [mcve] – Shinjo Sep 03 '19 at 10:14
  • 1
    Did you not read the package's documentation? "*[A browser implementation](https://github.com/parshap/truncate-utf8-bytes/blob/master/browser.js) that doesn't use Buffer.byteLength is provided*" (using [this](https://github.com/parshap/utf8-byte-length/blob/master/browser.js) and [that](https://github.com/parshap/truncate-utf8-bytes/blob/master/lib/truncate.js)). If your project doesn't use a bundler, that means you have to bundle manually, but the code is still there. – Bergi Sep 03 '19 at 10:15
  • 1
    how about something like [`this`](https://jsbin.com/dabumix/edit?js,console) – Code Maniac Sep 03 '19 at 10:17
  • @Bergi I did, my interpretation of that was it doesn't use the Buffer library which is a node.js module, hence why it says browser, because browser don't have those modules. – Uchenna Okafor Sep 03 '19 at 10:19
  • Anyways, I have found an answer. Thanks @CodeManiac and Shinjo – Uchenna Okafor Sep 03 '19 at 10:23

2 Answers2

2

Something like this should work, assuming you know the encoding of the text:

let str = 'Hello , I like rice cakes ¯\_(ツ)_/¯';
let enc = new TextEncoder();
let dec = new TextDecoder('utf-8');
let uint8 = enc.encode(str)
let section = uint8.slice(0,11)
let result = dec.decode(section);
console.log('result', result);
Gavin
  • 2,214
  • 2
  • 18
  • 26
  • I think this is exactly what I was looking for. I am already using TextDecoder to check the length in bytes, I just didn't know there was a TextEncoder. Thank you so much. Quick question, the uint8.slice(0, 11), does each array item represent one byte? – Uchenna Okafor Sep 03 '19 at 10:32
  • Yup, in that example uint8 is an array of 8-bit bytes (which is what you'd want for utf8). Note that slice will split through any characters that use multiple bytes. As I understand the TextDecoder makes it safe for rendering though but you might want to check that for your purposes. – Gavin Sep 03 '19 at 10:37
  • Also note: TextEncoder only supports utf8 now (https://developer.mozilla.org/en-US/docs/Web/API/TextEncoder) but there is a polyfill for other encodings. – Gavin Sep 03 '19 at 10:39
  • This solution will not correctly truncate multi-byte characters - for example if you do `slice(0,8)` instead, you will get a corrupted character: `'Hello �'` – Mikael Finstad Jul 30 '23 at 12:11
2

Answer 1 works great but you might consider adding this to the end to avoid ending up with invalid characters that were truncated mid character:

result.replace(/\uFFFD/g, '')
drnugent
  • 1,545
  • 9
  • 22
goebel02
  • 21
  • 2