1

I'm working on a big machine learning/nlp project and I'm stuck at a small part of it. (PM me, if you want to know what I'm working on exactly.)

I try to code a program in Javascript that learns to generate valid words, only by using all letters of the alphabet.

What I have is a database of 500K different words. It's a big JS object, structured like this (the words are german):

database = {
    "um": {id: 1, word: "um", freq: 10938},
    "oder": {id: 2, word: "oder", freq: 10257},
    "Er": {id: 3, word: "Er", freq: 9323},
    ...
}

"freq" means frequency obviously. (Maybe this value sometimes gets important but I currently don't use it, so just ignore it.)


The way my program currently works is: In the first iteration, it generates a completely random word between 2 and 13 letters long and searches for it in the database. If it's there, every letter in the word gets a good rating, if it's not there, they get a bad rating. Also the word length gets rated. If the word is valid, its word length gets a good rating, if it's not, its word length gets a bad rating.

In the iterations after that first one, it doesn't generate a word with random letters and a random word length. It uses probabilities based on the ratings of the letters and the word length.

For example, let's say it found the words "the", "so" and "if" after the first 100 iterations. So the letters "t", "h", "e" and the letters "s", "o", and the letters "i", "f" are good rated, and the word length of 2 and 3 is also good rated. So the word generated in the next iteration will more likely contain these good rated letters than bad rated letters.

Of course, the program also checks if the currently generated word already was generated and if so, then this word doesn't get rated again and it generates a new one.

In theory it should learn the optimal letter frequency and the optimal word-length-frequency by its own and sometimes only generate valid words.

Yeah. Of course this doesn't work. It gets better for the first few iterations, but as soon as it has found all the 2-lettered words it gets worse. I think my whole way how I do this is wrong. I've actually tried it out and have a (not so beautiful) graph after 5000 iterations for you:

(not so beautiful) graph

Red line: wrong words generated

Green line: right words generated


Yeah. What is the problem here? Am I doing machine learning wrong? And do you have a solution? Some algorithm or trie system?

PS: I'm aware of this, but it's not in JS, I don't understand it and I can't comment on it.

Community
  • 1
  • 1
Nano Miratus
  • 467
  • 3
  • 13
  • What are examples of "wrong" words? – guest271314 Apr 14 '17 at 04:53
  • words that are not found in the database – Nano Miratus Apr 14 '17 at 05:06
  • Comparatively, how are words that are not found in database different from words that are considered correct? – guest271314 Apr 14 '17 at 05:13
  • hm, i don't think i understand what you mean...the database is a 500k words dictionary of real words...and wrong words are those combinations of random letters which don't make up a word in the database – Nano Miratus Apr 14 '17 at 07:45
  • 1
    How closely related to the characters of the word in database is machine? For example "hi", "hl" or "hello", "helo". There is only "right" or "wrong" words? Have you tried implementing fuzzy logic to interpret a range of possible similar characters which could be represented as value for "right" letter? For machine output to be interpreted similarly to modern languages. In equivocal languages new terms are created constantly. There are languages which are not equivocal; still others which are not "written" as "words". What is logic for machine to "learn" "right" response? – guest271314 Apr 14 '17 at 08:08
  • 2
    You are learning a "character level language model". Search for it, and find what other people have done. – user3639557 Apr 14 '17 at 09:18

1 Answers1

3

An alternative method would be to use a Markov Model.

Start by counting up the letter frequencies and also word length frequencies in your dictionary. Then, to create a word:

  1. Pick a weighted random number (see below) between 1 and the maximum existing word length. That's how many letters you're going to generate.
  2. For each letter in the word, pick a weighted random letter and add it to the word.

That's an order-0 Markov model. It's based on the frequency of letters that occur in the corpus. It will probably give you results that are similar to the system you have.

You'll get better results from an order-1 Markov model, where instead of computing letter frequencies, you compute bigram (two-letter permutations) frequencies. So to pick the first letter, you choose only from the bigrams that are used to begin words. For subsequent letters, you choose a letter that follows the previously generated letter. That's going to give you somewhat better results than an order-0 model.

An order-2 model is surprisingly effective. See my blog post, Shakespeare vs. Markov, for an example.

A weighted random number is a number selected "at random," but skewed to reflect some distribution. In the English language, for example, the letter 'e' occurs approximately 12.7% of the time. 't' occurs 9.06% of the time, etc. See https://en.wikipedia.org/wiki/Letter_frequency. So you'd want your weighted random number generator's output to approximate that distribution. Or, in your case, you'd want it to approximate the distribution in your corpus. See Weighted random numbers for an example of how that's done.

Community
  • 1
  • 1
Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
  • okay, first of all, thank you! about the weighted letters: by my understanding, i already use this technique, but dynamically by generating a weighted list based on the occurences of letters in previously generated right words But: Thank you for the Markov model! i already thought of calculating the letter frequency and word length frequency before actually generating words, but i never knew how i could do this... – Nano Miratus Apr 14 '17 at 14:09
  • i will try order-2 model for now! thank you for your research on this topic, great blog! – Nano Miratus Apr 14 '17 at 14:15
  • 1
    @NanoMiratus: I think order-2 or order-3 will do what you want. My blog doesn't show an example of order-3 text, but it was primarily recognizable words from the corpus. – Jim Mischel Apr 14 '17 at 14:17