How do you create words which are not part of the English language, but sound English? For example: janertice, bellagom
-
Why would you need to do this? – dacracot Dec 11 '09 at 22:54
-
12A Markov chain built from a database of English syllables seems like a reasonable approach. What have you tried so far? – Eric Lippert Dec 11 '09 at 22:56
-
12I don't know but I've seen some other users on here that seem to have mastered this art. – Mark Byers Dec 11 '09 at 22:56
-
9The Daily WTF covered this quite well in the article titled: "The Automated Curse Generator": http://thedailywtf.com/articles/the-automated-curse-generator.aspx – Michael La Voie Dec 11 '09 at 22:57
-
I like how this got five responses mentioning Markov chains within the span of a couple minutes... – C. A. McCann Dec 11 '09 at 22:58
-
Duplicate: http://stackoverflow.com/questions/594273/random-word-generator – Robert Harvey Dec 11 '09 at 22:58
-
2Not dupe. The other one is for selecting a random word out of a grab bag, this is for actually making a random word. – RCIX Dec 11 '09 at 23:06
13 Answers
Consider this algorithm, which is really just a degenerate case of a Markov chain.

- 40,684
- 18
- 101
- 169
Take the start of one English word and the end of another and concatenate.
E.g.
Fortune + totality = fortality
You might want to add some more rules like only cutting your words on consonant-vowel boundaries and so on.

- 48,337
- 13
- 89
- 105
-
I agree. People rearrange prefixes/infixes/suffixes all the time subconsciously to create new English words. It's an exceptionally simple algorithm (heuristic?) in the mind, so it wouldn't be hard to implement in code. I'm happy to contribute to this post's upvotedness =) – Jacobs Data Solutions Jan 15 '10 at 15:11
-
Here's an example of somebody doing it. They talk about Markov chains and dissociated press.
Here's some code I found. You can run it online at codepad.
import random
vowels = ["a", "e", "i", "o", "u"]
consonants = ['b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'q',
'r', 's', 't', 'v', 'w', 'x', 'y', 'z']
def _vowel():
return random.choice(vowels)
def _consonant():
return random.choice(consonants)
def _cv():
return _consonant() + _vowel()
def _cvc():
return _cv() + _consonant()
def _syllable():
return random.choice([_vowel, _cv, _cvc])()
def create_fake_word():
""" This function generates a fake word by creating between two and three
random syllables and then joining them together.
"""
syllables = []
for x in range(random.randint(2,3)):
syllables.append(_syllable())
return "".join(syllables)
if __name__ == "__main__":
print create_fake_word()

- 12,302
- 4
- 34
- 52
You might be interested in How do I determine if a random string sounds like English?

- 1
- 1

- 41,170
- 10
- 86
- 110
I think this story will answer your question quite nicely.
It describes the development of a Markov chain algorithm quite nicely, including the pitfalls that come up.

- 63,815
- 23
- 109
- 159
One approach that's relatively easy and effective is to run a Markov chain generator per-character instead of per-word, using a large corpus of English words as source material.

- 76,893
- 19
- 209
- 302
Note: Linguistics is a hobby, but I am in no way an expert at it.
First you need to get a "dictionary" so to speak of English Phonemes.
Then you simply string them together.
While not the most complex and accurate solution, it should lead you to a generally acceptable outcome.
Far simpler to implement if you don't understand the complexities of the other solutions mentioned.

- 5,297
- 5
- 32
- 62
Using Markov chains is an easy way, as already pointed out. Just be careful that you don't end up with an Automated Curse Generator.

- 22,897
- 2
- 80
- 94
Use n-grams based off the English corpora with n > 3, that gets you an approximation.

- 39,638
- 28
- 112
- 212
I can't think of any cromulent ways of doing this.

- 5,376
- 1
- 23
- 34
-
2;-) This kind of humorous tidbits is most welcome in SO. (Helps us keep with otherwise terse material and also stops us from taking ourselves too seriously. This said this kind of of lines should be placed as a comment to the question, not as an answer! Thanks. – mjv Dec 14 '09 at 05:26
A common practice is to build a Markov Chain based on the letter transitions in a "training set" made of several words (noums?) from an English lexicon, and to then let this chain produce "random" words for you.

- 73,152
- 14
- 113
- 156
Markov chain is the way to go, as others have already posted. Here is an overview of the algorithm:
- Let H be a dictionary mapping letters to another dictionary mapping letters to the frequency they occur with.
- Initialize H by scanning through a corpus of text (for example, the Bible, or the Stack Overflow public data). This is a simple frequency count. An example entry might be H['t'] = {'t': 23, 'h': 300, 'a': 50}. Also create a special "start" symbol indicating the beginning of a word, and an "end" symbol for the end.
- Generate a word by starting with the "start" symbol, and then randomly picking a next letter based on the frequency counts. Generate each additional letter based on the last letter. For example, if the last letter is 't', then you will pick 'h' with probability 300/373, 't' with probability 23/373, and 'a' with probability 50/373. Stop when you hit the "end" symbol.
To make your algorithm more accurate, instead of mapping one letter to the next letters, you could map two letters to the next letter.

- 224,032
- 165
- 485
- 680
If you decide to go with a simple approach like the code Andy West suggested, you might get even better results by weighting the frequencies of vowels and consonants to correspond with those occurring normally in the English language: Wikipedia: Letter Frequency
You could even go as far as looking at the frequencies of paired letters or sequences of three letters, but at that point you're actually implementing the same idea as the Markov chain others have suggested. Is it more important that the "fake words" look potentially authentic to humans, or are the statistical properties of the words more important, such as in cryptographic applications?

- 349
- 1
- 3