10

I'm new to Machine Learning and Tensorflow, since I don't know python so I decide to use there javascript version (maybe more like a wrapper).

The problem is I tried to build a model that process the Natural Language. So the first step is tokenizer the text in order to feed the data to model. I did a lot research, but most of them are using python version of tensorflow that use method like: tf.keras.preprocessing.text.Tokenizer which I can't find similar in tensorflow.js. I'm stuck in this step and don't know how can I transfer text to vector that can feed to model. Please help :)

edkeveked
  • 17,989
  • 10
  • 55
  • 93
Dacredible
  • 197
  • 2
  • 11

4 Answers4

9

To transform text to vectors, there are lots of ways to do it, all depending on the use case. The most intuitive one, is the one using the term frequency, i.e , given the vocabulary of the corpus (all the words possible), all text document will be represented as a vector where each entry represents the occurrence of the word in text document.

With this vocabulary :

["machine", "learning", "is", "a", "new", "field", "in", "computer", "science"]

the following text:

["machine", "is", "a", "field", "machine", "is", "is"] 

will be transformed as this vector:

[2, 0, 3, 1, 0, 1, 0, 0, 0] 

One of the disadvantage of this technique is that there might be lots of 0 in the vector which has the same size as the vocabulary of the corpus. That is why there are others techniques. However the bag of words is often referred to. And there is a slight different version of it using tf.idf

const vocabulary = ["machine", "learning", "is", "a", "new", "field", "in", "computer", "science"]
const text = ["machine", "is", "a", "field", "machine", "is", "is"] 
const parse = (t) => vocabulary.map((w, i) => t.reduce((a, b) => b === w ? ++a : a , 0))
console.log(parse(text))

There is also the following module that might help to achieve what you want

edkeveked
  • 17,989
  • 10
  • 55
  • 93
  • Thank you so much for answering. So I think only transfer a word to a number is not enough for tensorflow.js to use. The next step should transfer each word to a multi-dimension vector, and I have no idea how to approach it. Also, if you want to do this thing by own, is that mean tensorflow.js doesn't has similar function to use for now? – Dacredible Aug 07 '18 at 21:30
  • 1
    tensorflowJs does not have string tokenizer yet. If the answer helped you, please upvote it and mark it as accepted. – edkeveked Aug 08 '18 at 04:26
  • 1
    I like the simplicity of the Keras tokenizer, so I recreated the parts of it that I need in JavaScript: https://gist.github.com/dlebech/5bbabaece36753f8a29e7921d8e5bfc7 Perhaps this will be useful for others that find this answer. – dlebech Apr 04 '19 at 20:04
4

Well, I faced this issue and handled it by following below steps:

  1. After tokenizer.fit_on_texts([data]) print tokenizer.word_index in your python code.
  2. copy and save the word_index output as json file.
  3. Refer to this json object to generate tokenized words, like this: function getTokenisedWord(seedWord) { const _token = word2index[seedWord.toLowerCase()] return tf.tensor1d([_token]) }
  4. Feed to model: const seedWordToken = getTokenisedWord('Hello'); model.predict(seedWordToken).data().then(predictions => { const resultIdx = tf.argMax(predictions).dataSync()[0]; console.log('Predicted Word ::', index2word[resultIdx]); })
  5. index2word is the reverse mapping of word2index json object.
Deepak P
  • 367
  • 2
  • 4
1

I have created an npm module for this recently also if it helps anyone.

https://github.com/rjmacarthy/string-tokeniser

rjmacarthy
  • 2,164
  • 1
  • 13
  • 22
0

const vocabulary = ["machine", "learning", "is", "a", "new", "field", "in", "computer", "science"]
const text = ["machine", "is", "a", "field", "machine", "is", "is"] 
const parse = (t) => vocabulary.map((w, i) => t.reduce((a, b) => b === w ? ++a : a , 0))
console.log(parse(text))
Kabeer Jaffri
  • 652
  • 1
  • 9
  • 9