I understand that if I have a categorical input that has several possible values (e.g. country or color), I can use a onehot tensor (represented as multiple 0s and only one 1).
I also understand that if the variable has many posible values (e.g. thousands of possible zip codes or school ids) a onehot tensor might not be efficient and we should use other representations (hash based?). But I have not found documentation nor examples on how to do this with JavaScript version of TensorFlow.
Any hints?
UPDATE @edkeveked gave me the right suggestion on using embeddings, but now I need some help on how to actually use embeddings with tensorflowjs.
Let me try with a concrete example:
Let's assume that I have records for people, for which I have age (integer), state (an integer from 0 to 49) and risk (0 or 1).
const data = [
{age: 20, state: 0, risk: 0},
{age: 30, state: 35, risk: 0},
{age: 60, state: 35, risk: 1},
{age: 75, state: 17, risk: 1},
...
]
When I have wanted to create a classifier model with tensorflowjs I would encode the state as a one-hot tensor, have the risk - label - as a onehot tensor (risk: 01, no risk 10) and build a model with dense layers such as the following:
const inputTensorAge = tf.tensor(data.map(d => d.age),[data.length,1])
const inputTensorState = tf.oneHot(data.map(d => d.state),50)
const labelTensor = tf.oneHot(data.map(d => d.risk),2)
const inputDims = 51;
const model = tf.sequential({
layers: [
tf.layers.dense({units: 8, inputDim:inputDims, activation: 'relu'}),
tf.layers.dense({units: 2, activation: 'softmax'}),
]
});
model.compile({loss: 'categoricalCrossentropy', "optimizer": "Adam", metrics:["accuracy"]});
model.fit(tf.concat([inputTensorState, inputTensorAge],1), labelTensor, {epochs:10})
(BTW ... I am new to tensorflow, so there might be much better approaches ... but this has worked for me)
Now ... my challenge. If I want a similar model but now I have a postcode instead of state (let's say that there are 10000 possible values for the postcode):
const data = [
{age: 20, postcode: 0, risk: 0},
{age: 30, postcode: 11, risk: 0},
{age: 60, postcode: 11, risk: 1},
{age: 75, postcode: 9876, risk: 1},
...
]
If I want to use embeddings, to represent the postcode, I understand that I should use an embedding layer such as:
tf.layers.embedding({inputDim:10000, outputDim: 20})
So, if I was using only the postcode as an input and omit the age, the model would be:
const model = tf.sequential({
layers: [
tf.layers.embedding({inputDim:10000, outputDim: 20})
tf.layers.dense({units: 2, activation: 'softmax'}),
]
});
If I create the inputtensor as
inputTensorPostcode = tf.tensor(data.map(d => d.postcode);
And try model.fit(inputTensorPostcode, labelTensor, {epochs:10})
It will not work ... so I am obviously doing something wrong.
Any hints on how should I create my model and do the model.fit with embeddings?
Also ... if I want to combine multiple inputs (let's say postcode and age), how should I do it?