How to define Input tensor that has thousands of possible categorical values?

Question

I understand that if I have a categorical input that has several possible values (e.g. country or color), I can use a onehot tensor (represented as multiple 0s and only one 1).

I also understand that if the variable has many posible values (e.g. thousands of possible zip codes or school ids) a onehot tensor might not be efficient and we should use other representations (hash based?). But I have not found documentation nor examples on how to do this with JavaScript version of TensorFlow.

Any hints?

UPDATE @edkeveked gave me the right suggestion on using embeddings, but now I need some help on how to actually use embeddings with tensorflowjs.

Let me try with a concrete example:

Let's assume that I have records for people, for which I have age (integer), state (an integer from 0 to 49) and risk (0 or 1).

const data = [
  {age: 20, state: 0, risk: 0},
  {age: 30, state: 35, risk: 0},
  {age: 60, state: 35, risk: 1},
  {age: 75, state: 17, risk: 1},
  ...
]

When I have wanted to create a classifier model with tensorflowjs I would encode the state as a one-hot tensor, have the risk - label - as a onehot tensor (risk: 01, no risk 10) and build a model with dense layers such as the following:

const inputTensorAge = tf.tensor(data.map(d => d.age),[data.length,1])
const inputTensorState =  tf.oneHot(data.map(d => d.state),50)
const labelTensor = tf.oneHot(data.map(d => d.risk),2)

const inputDims = 51;
const model = tf.sequential({
  layers: [
    tf.layers.dense({units: 8, inputDim:inputDims, activation: 'relu'}),
    tf.layers.dense({units: 2, activation: 'softmax'}),
  ]
});

model.compile({loss: 'categoricalCrossentropy', "optimizer": "Adam", metrics:["accuracy"]});

model.fit(tf.concat([inputTensorState, inputTensorAge],1), labelTensor, {epochs:10})

(BTW ... I am new to tensorflow, so there might be much better approaches ... but this has worked for me)

Now ... my challenge. If I want a similar model but now I have a postcode instead of state (let's say that there are 10000 possible values for the postcode):

const data = [
  {age: 20, postcode: 0, risk: 0},
  {age: 30, postcode: 11, risk: 0},
  {age: 60, postcode: 11, risk: 1},
  {age: 75, postcode: 9876, risk: 1},
  ...
]

If I want to use embeddings, to represent the postcode, I understand that I should use an embedding layer such as:

tf.layers.embedding({inputDim:10000, outputDim: 20})

So, if I was using only the postcode as an input and omit the age, the model would be:

const model = tf.sequential({
  layers: [
    tf.layers.embedding({inputDim:10000, outputDim: 20})
    tf.layers.dense({units: 2, activation: 'softmax'}),
  ]
});

If I create the inputtensor as

inputTensorPostcode = tf.tensor(data.map(d => d.postcode);

And try model.fit(inputTensorPostcode, labelTensor, {epochs:10})

It will not work ... so I am obviously doing something wrong.

Any hints on how should I create my model and do the model.fit with embeddings?

Also ... if I want to combine multiple inputs (let's say postcode and age), how should I do it?

edkeveked · Accepted Answer · 2018-10-24T04:10:41.677

For categorical data, one might use a one-hot encoding to solve the problem. The issue with one-hot encoding is that it often leads to sparse data with a lot of zero.

The other way to deal with categorical data is to reduce the dimensions of the input data. This technique is known as embeddings. For creating models involving categorical data, one might use the embedding layer offered in the Js API.

Edit: The data is not really a categorical data, though it is possible to build it as such and there is no reason for doing so. An example of classical categorical data for recommendation system is a data containing the moovies a user watched or not. The data will look like the following:

       ________________________________________________
       | moovie 1 | moovie 2 | moovie 3| ---  | moovie n|
       |__________|__________|_________|______|_________|
user 1 |    0     |    1     |    1    | ---  |     0   |
user 2 |    0     |    0     |    1    | ---  |     0   |
user 3 |    0     |    1     |    0    | ---  |     0   |
  .    |    .     |    .     |    .    | ---  |     .   | 
  .    |    .     |    .     |    .    | ---  |     .   |
  .    |    .     |    .     |    .    | ---  |     .   |

The input dimension here is the number of moovies n. Such a data can be very sparsed with a lot of zeros. For the database may contain hundred of thousands of moovies and the average user could hardly have watched more than a thousand ones. In that case there will be a thousand fields with 1 and all the rest with 0. Such a data needs to be aggregated using embeddings in order to lower the dimension from n to something smaller.

That is not the case here. The input data has only 2 features age and postcode. The input data dimension is 2 and the output (label) is always of one dimension (the label here is the risk property). But since there are two categories, the input dimension will have a size of 2. The range of values of postcode does not affect our categorization

const data = [
  {age: 20, state: 0, risk: 0},
  {age: 30, state: 35, risk: 0},
  {age: 60, state: 35, risk: 1},
  {age: 75, state: 17, risk: 1}
]

const model = tf.sequential()
model.add(tf.layers.dense({inputShape: [2], units: 10, activation: 'relu'}))
model.add(tf.layers.dense({activation: 'softmax', units: 2}))
const x = tf.tensor2d(data.map(e => [e.age, e.state]), [data.length, 2])
const y = tf.oneHot(tf.tensor1d(data.map(e => e.risk), "int32"), 2)

model.compile({optimizer: 'adam', loss: 'categoricalCrossentropy' })
model.fit(x, y, {epochs: 10}).then(() => {
  // prediction will look like [p, 1-p] with  0 <= p <= 1
  // predictions [p, 1-p] such that p > 0.5 are in one category
  // predictions [p, 1-p] such that 1-p > 0.5 are in the 2 category
  // prediction for age 30 and postcode 35 is the same with age 0 and postcode 35 
  // (they both will either have p > 0.5 or p < 0.5)
  // the previous prediction will be different for age 75 postcode 17
  model.predict(tf.tensor2d([[30, 35], [0, 20], [75, 17]])).print()
})

<html>
  <head>
    <!-- Load TensorFlow.js -->
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@0.13.0"> </script>
  </head>

  <body>
  </body>
</html>

Thanks for pointing me into this direction! I was not aware of embedding ... and it seems to be what I needed. If you know any tutorial / or concrete example that shows actual implementations of embeddings in TensorFlow.js, I would appreciate it. — elaval, Oct 22 '18 at 01:42
I don't have any implementation of the embeddings. But you can write your own model with an embedding layer in the middle if you do understand how it works — edkeveked, Oct 22 '18 at 17:46
Thanks @edkeveked. I have tried to add an embedding layer in the middle, but I am afraid that I am missing something. I edited the original question, adding a very simple case were I would like to use embeddings. If you can have a look at it I would appreciate it — elaval, Oct 23 '18 at 17:57
I edited the answer with more explanation. Let me know it that is okay with your understanding — edkeveked, Oct 24 '18 at 03:59
I really appreciate the answer!! In your example, we are treating age and state in a similar way (as a continuous integer variable). I thought that this was OK for "age", but that "state" should be treated as a "nominal" variable and therefore we needed to use a oneHot representation, this is were I thought that for a variable like postcode we needed an alternative to oneHot. So ...do you recommend using an integer value to represent the state? (tf.tensor2d(data.map(e => [e.age, e.state]), [data.length, 2]). Would you do the same with postcodes (which could be 10000 instead of 50)? Thanks — elaval, Oct 24 '18 at 12:45
Treating the postcode feature as a continuous integer variable is intuitive. Doing otherwise and consider it as a nominal variable is possible but counter intuitive, I would say. As for the range 0-10000 of the feature `postcode`, it is not an issue. What you can do if you see fit, is to normalize the data using the mean and the standard deviation. But the direct and simple approach is already enough — edkeveked, Oct 24 '18 at 17:52

How to define Input tensor that has thousands of possible categorical values?

1 Answers1