3

I'm building an NLP classifier in python and would like to build a hosting HTML page for a demo. I want to test on a sample text to see the prediction and this is implemented in python through tokenizing the text and then padding it before predicting. Like this:

tf.tokenizer.texts_to_sequences(text)
token_list = tf.tokenizer.texts_to_sequences([text])[0]
token_list_padded = pad_sequences([token_list], maxlen=max_length, padding=padding_type)

The problem is that I'm new to javascript, so is there tokenization and padding methods in javascript like in python?

edkeveked
  • 17,989
  • 10
  • 55
  • 93
ezzeddin
  • 499
  • 2
  • 5
  • 25
  • You may want to look at https://ml5js.org/ it's a js library that is built on top on tensorflow. – JDunken Jan 02 '20 at 14:02
  • I think ml3js is pretty new and does not support functions in NLP like *tokenizer* and *pad_sequences* – ezzeddin Jan 03 '20 at 05:51

3 Answers3

1

There is not yet a tf.tokenizer in js as there is in python.

A simple js.tokenizer has been described here. A more robust approach would be to use the tokenizer that comes with universal sentence encoder

edkeveked
  • 17,989
  • 10
  • 55
  • 93
1

There is no native mechanism for tokenization in Javascript.

You can use a Javascript library such as natural or wink-tokenizer or wink-nlp. The last library automatically extracts a number of token's features that may be useful in training.

sks
  • 149
  • 6
0

You can use gpt-tokenizer now here is an example with npm

import {
  encode,
  encodeChat,
  decode,
  isWithinTokenLimit,
  encodeGenerator,
  decodeGenerator,
  decodeAsyncGenerator,
} from 'gpt-tokenizer'

const text = 'Hello, world!'
const tokenLimit = 10

// Encode text into tokens
const tokens = encode(text)

// Decode tokens back into text
const decodedText = decode(tokens)

// Check if text is within the token limit
// returns false if the limit is exceeded, otherwise returns the actual number of tokens (truthy value)
const withinTokenLimit = isWithinTokenLimit(text, tokenLimit)

// Example chat:
const chat = [
  { role: 'system', content: 'You are a helpful assistant.' },
  { role: 'assistant', content: 'gpt-tokenizer is awesome.' },
]

// Encode chat into tokens
const chatTokens = encodeChat(chat)

// Check if chat is within the token limit
const chatWithinTokenLimit = isWithinTokenLimit(chat, tokenLimit)

// Encode text using generator
for (const tokenChunk of encodeGenerator(text)) {
  console.log(tokenChunk)
}

// Decode tokens using generator
for (const textChunk of decodeGenerator(tokens)) {
  console.log(textChunk)
}

// Decode tokens using async generator
// (assuming `asyncTokens` is an AsyncIterableIterator<number>)
for await (const textChunk of decodeAsyncGenerator(asyncTokens)) {
  console.log(textChunk)
}
Jonathan Coletti
  • 448
  • 4
  • 13