Encog Neural Net - How to structure training data?

Question

Every example I've seen for Encog neural nets has involved XOR or something very simple. I have around 10,000 sentences and each word in the sentence has some type of tag. The input layer needs to take 2 inputs, the previous word and the current word. If there is no previous word, then the 1st input is not activated at all. I need to go through each sentence like this. Each word is contingent on the previous word, so I can't just have an array that looks similar to the XOR example. Furthermore, I don't really want to load all the words from 10,000+ sentences into an array, I'd rather scan one sentence at a time and once I reach EOF, start back at the beginning.

How should I go about doing this? I'm not super comfortable with Encog because all the examples I've seen have either been XOR or extremely complicated.

There are 2 inputs... Each input consists of 30 neurons. The chance of the word being a certain tag is used as inputs. So, most of the neurons get 0, the others get probability inputs like .5, .3, and .2. When I say 'aren't activated' I just mean that all the neurons are set to 0. The output layer represents all the possible tags, so, its 30. Whatever one of the output neurons has the highest number is the tag that is chosen.

I'm not sure how to go through all 10,000 sentences and look-up each word in each sentence (for the inputs and activate that input) in the 'demos' of Encog that I've seen.)

It seems that the networks are trained with a single array holding all training data, and that is looped through until the network is trained. I would like to train the network with many different arrays (an array per sentence) and then look through them all again.

This format is clearly not going to work for what I'm doing:

    do {
    train.iteration();
    System.out.println(
    "Epoch #" + epoch + " Error:" + train.getError());
    epoch++;
    } while(train.getError() > 0.01);

What neural net? FeedForward? If yes, use Elman networks instead because it naturally has past context in its hidden layer which you are trying to cram into the input artificially like in a Time-Delay Network. Beware though that Encog still has no proper BPTT afaik. — runDOSrun, Jul 01 '15 at 09:44

score 2 · Accepted Answer · edited Jul 01 '15 at 01:07

2

So, I'm not sure how to tell you this, but that's not how a neural net works. You can't just use a word as an input, and you can't just "not activate" an input either. At a very basic level, this is what you need to run a neural network on a problem:

A fixed-length input vector (whatever you are feeding in, it must be represented numerically with a fixed length. Each entry in the vector is a single number)
A set of labels (each input vector must correspond to a single, fixed-length output vector)

Once you have those two, the neural net classifies an example, then edits itself to get as close as possible to the labels.

If you're looking to work with words and a deep learning framework, you should map your words to an existing vector representation (I would highly recommend glove, but word2vec is decent as well) and then learn on top of that representation.

After having a deeper understanding of what you're attempting here I think the issue is that you're dealing with 60 inputs, not one. These inputs are the concatenation of the existing predictions for both words (in the case with no first word the first 30 entries are 0). You should take care of the mapping yourself (should be very straightforward), and then just treat it as trying to predict 30 numbers with 60 numbers.

I feel obliged to tell you that the way you've framed the problem you will see awful performance. When dealing with a sparse (mostly zeros) vector and such a small dataset deep learning techniques will show VERY poor performance compared to other methods. You are better off using glove + svm or a random forest model on your existing data.

edited Jul 01 '15 at 01:07

Robert Harvey

178,213
47
333
501

answered Jun 30 '15 at 20:56

Slater Victoroff

21,376
21
85
144

3

Okay, my apologies for over-simplifying my question... There are 2 inputs... Each input consists of 30 neurons. The chance of the word being a certain tag is used as inputs. So, most of the neurons get 0, the others get probability inputs like .5, .3, and .2. When I say 'aren't activated' I just mean that all the neurons are set to 0. The output layer represents all the possible tags, so, its 30. Whatever one of the output neurons has the highest number is the tag that is chosen. – Nate Cook3 Jun 30 '15 at 21:00
2

So you're trying to make a single more accurate prediction based on two predictions? – Slater Victoroff Jun 30 '15 at 21:02
1

(I'm not sure how to go through all 10,000 sentences and look-up each word in each sentence (for the inputs and activate that input) in the 'demos' of Encog that I've seen.) – Nate Cook3 Jun 30 '15 at 21:03
@NateCook3 I'm just making sure that we're on the same page here. – Slater Victoroff Jun 30 '15 at 21:03
I want to stick with a neural network for parts of speech tagging. I was planning on also trying to look at 3 words, the previous word, the word to be tagged, and the following word. Would this help the problem, or will there still be too many 0s? I'm not exactly sure what you mean by mapping the words to an existing vector representation. – Nate Cook3 Jun 30 '15 at 21:19
@NateCook3 it's not just about context here (though that would probably help). Under the hood your model is not dealing with a word, but is dealing with a list of numbers. The short is that the typical method of mapping words to lists of numbers is very poorly suited for neural networks. There are other mapping you can use (Glove, and word2vec) that will give you much better performance. Does that make sense? – Slater Victoroff Jun 30 '15 at 21:26
1

Okay, I think that makes sense. My word representation was a probability of which part-of-speech that word had been tagged as in a training set. Most of the time a word had the potential to be tagged as one of 2-5 different parts of speech. You're saying that using probabilities of word W being which Tags t1, t2, t3, ... tx is not a good way to represent what that word actually is? I've read several different papers that use this strategy and most of them get in the mid 90s% for recognition rates. – Nate Cook3 Jun 30 '15 at 21:32

Andy Thomas · Answer 2 · 2015-06-30T21:23:48.343

2

You can use other implementations of MLDataSet besides BasicMLDataSet.

I ran into a similar problem with windows of DNA sequences. Building an array of all the windows would not have been scalable.

Instead, I implemented my own VersatileDataSource, and wrapped it in a VersatileMLDataSet.

VersatileDataSource has just a few methods to implement:

public interface VersatileDataSource {
    String[] readLine();
    void rewind();
    int columnIndex(String name);
}

For each readLine(), you could return the inputs for the previous/current word, and advance the position to the next word.

edited Jun 30 '15 at 21:23

answered Jun 30 '15 at 20:59

Andy Thomas

84,978
11
107
151

1

Ignoring the X in an XY problem :/ – Slater Victoroff Jun 30 '15 at 21:10
@SlaterTyranus - I think addressing the OP's actual question requires some knowledge of Encog. You can find a manual here: https://s3.amazonaws.com/heatonresearch-books/free/encog-3_3-quickstart.pdf. – Andy Thomas Jun 30 '15 at 21:14
I understand encog, that doesn't change the fact that this is an XY problem. – Slater Victoroff Jun 30 '15 at 21:20
@AndyThomas so, the training code in the original question would need to change as well? – jonbon Jun 30 '15 at 21:27
@jonbon - I'm not sure I see why you're suggesting that. The `MLTrain` interface used in the training code in the original question would be unchanged. And most of the encog APIs that take a dataset accept an `MLDataSet`, and a `VersatileMLDataSet` implements that interface. – Andy Thomas Jun 30 '15 at 21:35
@AndyThomas okay, I think I was just confused because the interface doesn't look complete. You were adding that interface inside the Jar file and editing etc? – jonbon Jun 30 '15 at 21:46
@jonbon - It's a nicely minimal interface. Instances get wrapped up in a VersatileMLDataSet, which can usually be used in place of a BasicMLDataSet. – Andy Thomas Jun 30 '15 at 21:53

Encog Neural Net - How to structure training data?

2 Answers2

Linked