6

The NeuralDataSet objects that I've seen in action haven't been anything but XOR which is just two small data arrays... I haven't been able to figure out anything from the documentation on MLDataSet.

It seems like everything must be loaded at once. However, I would like to loop through training data until I reach EOF and then count that as 1 epoch.. However, everything I've seen all the data must be loaded into 1 2D array from the beginning. How can I get around this?

I've read this question, and the answers didn't really help me. And besides that, I haven't found a similar question asked on here.

Community
  • 1
  • 1
jonbon
  • 1,142
  • 3
  • 12
  • 37
  • Out of curiosity: why do you want to stream the data, is it a memory / volume question? – Elmar Weber Jul 19 '15 at 14:27
  • 1
    @ElmarWeber because the data is fairly large, and even more importantly, I already have an implementation with another Neural Network. The implementation backpropagates a single input at a time and I'm just looping through the whole input file and calling mlp.backprop(nextInput) for each item scanned. – jonbon Jul 19 '15 at 15:04

1 Answers1

2

This is possible, you can either use an existing implementation of a data set that supports streaming operation or you can implement your own on top of whatever source you have. Check out the BasicMLDataSet interface and the SQLNeuralDataSet code as an example. You will have to implement a codec if you have a specific format. For CSV there is an implementation already, I haven't checked if it is memory based though.

Remember when doing this that your data will be streamed fully for each epoch and from my experience that is a much higher bottleneck than the actual computation of the network.

Elmar Weber
  • 2,683
  • 28
  • 28
  • 1
    Basically, the data I'm training the network with is coming from the Brown Corpus. Its not like the examples with a simple XOR where I can hard code it. Currently, I have Brown Corpus files where each word and tags are separated by spaces. The network I was currently using was not loading the entire corpus into memory, but loading sentence by sentence and training (backpropagating) one word at a time. Does that make sense? I'm just trying to use another network because I think there may be a bug in the network I'm currently using. – jonbon Jul 19 '15 at 15:07
  • Not sure if I got it right, but but the way you are describing it, would mean that the way the SQLNeuralDataSet is implemented would work, right? You encode the input and output values per word, run the backprop, get the next one, etc. If you really want to go row by row and not work in batches as the default implementation does, just set the batch size to one. In the end you have two pieces of code: Corpus Data to MLData for one single input and output item, and something that feeds this row by row. – Elmar Weber Jul 19 '15 at 18:44
  • 1
    Okay, the fact that I'm using a text file and not SQL made me not even consider the SQLNeuralDataSet, thanks, I'll look into it! – jonbon Jul 20 '15 at 15:41
  • There is also a CSVNeuralDataSet which follows the same pattern, but I have not checked if it implements a row by row mode or just loads the whole file into memory. – Elmar Weber Jul 20 '15 at 16:54