tensorflowjs - Is there an equivalent method for tokenizer in javascript?

Question

I'm building an NLP classifier in python and would like to build a hosting HTML page for a demo. I want to test on a sample text to see the prediction and this is implemented in python through tokenizing the text and then padding it before predicting. Like this:

tf.tokenizer.texts_to_sequences(text)
token_list = tf.tokenizer.texts_to_sequences([text])[0]
token_list_padded = pad_sequences([token_list], maxlen=max_length, padding=padding_type)

The problem is that I'm new to javascript, so is there tokenization and padding methods in javascript like in python?

You may want to look at https://ml5js.org/ it's a js library that is built on top on tensorflow. — JDunken, Jan 02 '20 at 14:02
I think ml3js is pretty new and does not support functions in NLP like *tokenizer* and *pad_sequences* — ezzeddin, Jan 03 '20 at 05:51

score 1 · Answer 1 · answered Jan 09 '20 at 14:33

1

There is not yet a tf.tokenizer in js as there is in python.

A simple js.tokenizer has been described here. A more robust approach would be to use the tokenizer that comes with universal sentence encoder

answered Jan 09 '20 at 14:33

edkeveked

17,989
10
55
93

score 1 · Answer 2 · answered Aug 02 '20 at 11:16

1

There is no native mechanism for tokenization in Javascript.

You can use a Javascript library such as natural or wink-tokenizer or wink-nlp. The last library automatically extracts a number of token's features that may be useful in training.

answered Aug 02 '20 at 11:16

sks

149
6

score 0 · Answer 3 · answered Jun 21 '23 at 20:49

You can use gpt-tokenizer now here is an example with npm

import {
  encode,
  encodeChat,
  decode,
  isWithinTokenLimit,
  encodeGenerator,
  decodeGenerator,
  decodeAsyncGenerator,
} from 'gpt-tokenizer'

const text = 'Hello, world!'
const tokenLimit = 10

// Encode text into tokens
const tokens = encode(text)

// Decode tokens back into text
const decodedText = decode(tokens)

// Check if text is within the token limit
// returns false if the limit is exceeded, otherwise returns the actual number of tokens (truthy value)
const withinTokenLimit = isWithinTokenLimit(text, tokenLimit)

// Example chat:
const chat = [
  { role: 'system', content: 'You are a helpful assistant.' },
  { role: 'assistant', content: 'gpt-tokenizer is awesome.' },
]

// Encode chat into tokens
const chatTokens = encodeChat(chat)

// Check if chat is within the token limit
const chatWithinTokenLimit = isWithinTokenLimit(chat, tokenLimit)

// Encode text using generator
for (const tokenChunk of encodeGenerator(text)) {
  console.log(tokenChunk)
}

// Decode tokens using generator
for (const textChunk of decodeGenerator(tokens)) {
  console.log(textChunk)
}

// Decode tokens using async generator
// (assuming `asyncTokens` is an AsyncIterableIterator<number>)
for await (const textChunk of decodeAsyncGenerator(asyncTokens)) {
  console.log(textChunk)
}

tensorflowjs - Is there an equivalent method for tokenizer in javascript?

3 Answers3