2

I'm training an LSTM on some spam - I have two classes: "spam" and "ham". I preprocess the data by splitting each message up into characters and then one-hot encoding the characters. Then I attribute it to the corresponding vector - [0] for "ham" and [1] for "spam". This code preprocesses the data:

const fs = require("fs");
const R = require("ramda");
const txt = fs.readFileSync("spam.txt").toString();
const encodeChars = string => {
    const vecLength = 127;
    const genVec = (char) => R.update(char.charCodeAt(0), 1, Array(vecLength).fill(0));
    return string.split('').map(char => char.charCodeAt(0) < vecLength ? genVec(char) : "invalid");
}
const data = R.pipe(
    R.split(",,,"),
    R.map(
        R.pipe(
            x => [(x.split(",").slice(1).concat("")).reduce((t, v) => t.concat(v)), x.split(",")[0]],
            R.adjust(1, R.replace(/\r|\n/g, "")),
            R.adjust(0, encodeChars),
            R.adjust(1, x => x === "ham" ? [0] : [1])
        )
    ),
    R.filter(R.pipe(
        R.prop(0),
        x => !R.contains("invalid", x)
    ))
)(txt);
fs.writeFileSync("data.json", JSON.stringify(data))

Then, using the encoded vectors from data.json, I port the data into to tensorflow:

const fs = require("fs");
const data = JSON.parse(fs.readFileSync("data.json").toString()).sort(() => Math.random() - 0.5)
const train = data.slice(0, Math.floor(data.length * 0.8));
const test = data.slice(Math.floor(data.length * 0.8));
const tf = require("@tensorflow/tfjs-node");
const model = tf.sequential({
    layers: [
        tf.layers.lstm({ inputShape: [null, 127], units: 16, activation: "relu", returnSequences: true }),
        tf.layers.lstm({ units: 16, activation: "relu", returnSequences: true }),
        tf.layers.lstm({ units: 16, activation: "relu", returnSequences: true }),
        tf.layers.dense({ units: 1, activation: "softmax" }),
    ]
})
const tdata = tf.tensor3d(train.map(x => x[0]));
const tlabels = tf.tensor2d(train.map(x => x[1]));
model.compile({
    optimizer: "adam",
    loss: "categoricalCrossentropy",
    metrics: ["accuracy"]
})
model.fit(tdata, tlabels, {
    epochs: 1,
    batchSize: 32,
    callbacks: {
        onBatchEnd(batch, logs) {
            console.log(logs.acc)
        }
    }
})

tdata is 3-dimensional, and tlabels is 2-dimensional, so everything should work fine. However, when I run the code, I get the following error: Error when checking target: expected dense_Dense1 to have 3 dimension(s). but got array with shape 4032,1 Does anyone know what went wrong here - I couldn't figure it out. Thx!

Notes: I already tried normalizing the length of the vectors by adding "null" on the end of the message vectors to put them all at a standardized length. I still got the same error.

N8Javascript
  • 140
  • 10

1 Answers1

1

The last layer of the LSTM should set returnSequences: false, equivalent to a flatten layer. This will fix the error of the answer.

Error when checking target: expected dense_Dense1 to have 3 dimension(s). but got array with shape 4032,1

To elaborate more on the answer, there is more than character encoding. Actually instead of encode each character, the dataset should be tokenized. A simple word tokenizer can be used as explained here or use the tokenizer that comes with the universal sentence encoder. The LSTM sequence can be made of the unique identifier of each token.

Additionnaly, using a single unit for the last layer does not reflect a classification approach. It is more as if we are predicting a value than a class. Two units (one for spam, the other for ham) should be used so as to have a onehot encoding for the labels.

edkeveked
  • 17,989
  • 10
  • 55
  • 93