I understand that ANN input must be normalized, standardized, etc. Leaving the peculiarities and models of various ANN's aside, how can I preprocess UTF-8 encoded text within the range of {0,1} or alternatively between the range {-1,1} before it is given as input to neural networks? I have been searching for this on google but can't find any information (I may be using the wrong term).
- Does that make sense?
- Isn't that how text is preprocessed for neural networks?
- Are there any alternatives?
Update on November 2013
I have long accepted as correct the answer of Pete. However, I have serious doubts, mostly due to recent research I've been doing on Symbolic knowledge and ANN's.
Dario Floreano and Claudio Mattiussi in their book explain that such processing is indeed possible, by using distributed encoding.
Indeed if you try a google scholar search, there exists a plethora of neuroscience articles and papers on how distributed encoding is hypothesized to be used by brains in order to encode Symbolic Knowledge.
Teuvo Kohonen, in his paper "Self Organizing Maps" explains:
One might think that applying the neural adaptation laws to a symbol set (regarded as a set of vectorial variables) might create a topographic map that displays the "logical distances" between the symbols. However, there occurs a problem which lies in the different nature of symbols as compared with continuous data. For the latter, similarity always shows up in a natural way, as the metric differences between their continuous encodings. This is no longer true for discrete, symbolic items, such as words, for which no metric has been defined. It is in the very nature of a symbol that its meaning is dissociated from its encoding.
However, Kohonen did manage to deal with Symbolic Information in SOMs!
Furthermore, Prof Dr Alfred Ultsch in his paper "The Integration of Neural Networks with Symbolic Knowledge Processing" deals exactly with how to process Symbolic Knowledge (such as text) in ANN's. Ultsch offers the following methodologies for processing Symbolic Knowledge: Neural Approximative Reasoning, Neural Unification, Introspection and Integrated Knowledge Acquisition. Albeit little information can be found on those in google scholar or anywhere else for that matter.
Pete in his answer is right about semantics. Semantics in ANN's are usually disconnected. However, following reference, provides insight how researchers have used RBMs, trained to recognize similarity in semantics of different word inputs, thus it shouldn't be impossible to have semantics, but would require a layered approach, or a secondary ANN if semantics are required.
Natural Language Processing With Subsymbolic Neural Networks, Risto Miikkulainen, 1997 Training Restricted Boltzmann Machines on Word Observations, G.E.Dahl, Ryan.P.Adams, H.Rarochelle, 2012
Update on January 2021
The field of NLP and Deep Learning has seen a resurgence in research in the past few years and since I asked that Question. There are now Machine-learning models which address what I was trying to achieve in many different ways.
For anyone arriving to this question wondering on how to pre-process text in Deep Learning or Neural Networks, here's a few helpful topics, none of which are Academic, but simple to understand and which should get you started on solving similar tasks:
- Vector Space Models
- Transformers
- Recurrent and Convolutional Networks for Text Classification
- Word Embedding
- Text Pre-processing
At the time I was asking that question, RNN, CNN and VSM were about to start being used, nowadays most Deep Learning frameworks support extensive Word Embeddings. Hope the above helps.
Update January 2023
After recent announcements of ChatGPT, Large Language Models, etc, and since NLP has blown up out of proportions, addressing my original question (processing string of text) at a character level is now possible. The real question is why would you want to do that which is an entirely different topic. For some information on how CNN, RNN, Transformers and other models can achieve that, see this blog post here which explains how character embeddings can be used. Similarly, other sources explain in more detail, such as: