I'm working on a big machine learning/nlp project and I'm stuck at a small part of it. (PM me, if you want to know what I'm working on exactly.)
I try to code a program in Javascript that learns to generate valid words, only by using all letters of the alphabet.
What I have is a database of 500K different words. It's a big JS object, structured like this (the words are german):
database = {
"um": {id: 1, word: "um", freq: 10938},
"oder": {id: 2, word: "oder", freq: 10257},
"Er": {id: 3, word: "Er", freq: 9323},
...
}
"freq"
means frequency obviously. (Maybe this value sometimes gets important but I currently don't use it, so just ignore it.)
The way my program currently works is: In the first iteration, it generates a completely random word between 2 and 13 letters long and searches for it in the database. If it's there, every letter in the word gets a good rating, if it's not there, they get a bad rating. Also the word length gets rated. If the word is valid, its word length gets a good rating, if it's not, its word length gets a bad rating.
In the iterations after that first one, it doesn't generate a word with random letters and a random word length. It uses probabilities based on the ratings of the letters and the word length.
For example, let's say it found the words "the", "so" and "if" after the first 100 iterations. So the letters "t", "h", "e" and the letters "s", "o", and the letters "i", "f" are good rated, and the word length of 2 and 3 is also good rated. So the word generated in the next iteration will more likely contain these good rated letters than bad rated letters.
Of course, the program also checks if the currently generated word already was generated and if so, then this word doesn't get rated again and it generates a new one.
In theory it should learn the optimal letter frequency and the optimal word-length-frequency by its own and sometimes only generate valid words.
Yeah. Of course this doesn't work. It gets better for the first few iterations, but as soon as it has found all the 2-lettered words it gets worse. I think my whole way how I do this is wrong. I've actually tried it out and have a (not so beautiful) graph after 5000 iterations for you:
Red line: wrong words generated
Green line: right words generated
Yeah. What is the problem here? Am I doing machine learning wrong? And do you have a solution? Some algorithm or trie system?
PS: I'm aware of this, but it's not in JS, I don't understand it and I can't comment on it.