How do you create an English like word?

Question

How do you create words which are not part of the English language, but sound English? For example: janertice, bellagom

A Markov chain built from a database of English syllables seems like a reasonable approach. What have you tried so far? — Eric Lippert, Dec 11 '09 at 22:56
I don't know but I've seen some other users on here that seem to have mastered this art. — Mark Byers, Dec 11 '09 at 22:56
The Daily WTF covered this quite well in the article titled: "The Automated Curse Generator": http://thedailywtf.com/articles/the-automated-curse-generator.aspx — Michael La Voie, Dec 11 '09 at 22:57
I like how this got five responses mentioning Markov chains within the span of a couple minutes... — C. A. McCann, Dec 11 '09 at 22:58
Duplicate: http://stackoverflow.com/questions/594273/random-word-generator — Robert Harvey, Dec 11 '09 at 22:58
Not dupe. The other one is for selecting a random word out of a grab bag, this is for actually making a random word. — RCIX, Dec 11 '09 at 23:06

score 15 · Accepted Answer · answered Dec 11 '09 at 22:55

15

Consider this algorithm, which is really just a degenerate case of a Markov chain.

answered Dec 11 '09 at 22:55

JSBձոգչ

40,684
18
101
169

score 15 · Answer 2 · answered Dec 11 '09 at 23:28

15

Take the start of one English word and the end of another and concatenate.

E.g.

Fortune + totality = fortality

You might want to add some more rules like only cutting your words on consonant-vowel boundaries and so on.

answered Dec 11 '09 at 23:28

Artelius

48,337
13
89
105

I agree. People rearrange prefixes/infixes/suffixes all the time subconsciously to create new English words. It's an exceptionally simple algorithm (heuristic?) in the mind, so it wouldn't be hard to implement in code. I'm happy to contribute to this post's upvotedness =) – Jacobs Data Solutions Jan 15 '10 at 15:11
And then check the dictionary to make sure it's not real. – Tatarize Aug 26 '16 at 21:25

Andy West · Answer 3 · 2009-12-11T23:08:23.393

Here's an example of somebody doing it. They talk about Markov chains and dissociated press.

Here's some code I found. You can run it online at codepad.

import random

vowels = ["a", "e", "i", "o", "u"]
consonants = ['b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'q', 
              'r', 's', 't', 'v', 'w', 'x', 'y', 'z']

def _vowel():
    return random.choice(vowels)

def _consonant():
    return random.choice(consonants)

def _cv():
    return _consonant() + _vowel()

def _cvc():
    return _cv() + _consonant()

def _syllable():
    return random.choice([_vowel, _cv, _cvc])()

def create_fake_word():
    """ This function generates a fake word by creating between two and three
        random syllables and then joining them together.
    """
    syllables = []
    for x in range(random.randint(2,3)):
        syllables.append(_syllable())
    return "".join(syllables)

if __name__ == "__main__":
    print create_fake_word()

This post reminds me of Raymond Che's blog posts (with all the links) ;) — RCIX, Dec 11 '09 at 23:22

score 3 · Answer 4 · edited May 23 '17 at 12:32

3

You might be interested in How do I determine if a random string sounds like English?

edited May 23 '17 at 12:32

Community

1
1

answered Dec 11 '09 at 22:54

Chris Fulstow

41,170
10
86
110

score 3 · Answer 5 · answered Dec 11 '09 at 23:00

3

I think this story will answer your question quite nicely.

It describes the development of a Markov chain algorithm quite nicely, including the pitfalls that come up.

answered Dec 11 '09 at 23:00

abelenky

63,815
23
109
159

score 2 · Answer 6 · answered Dec 11 '09 at 22:56

2

One approach that's relatively easy and effective is to run a Markov chain generator per-character instead of per-word, using a large corpus of English words as source material.

answered Dec 11 '09 at 22:56

C. A. McCann

76,893
19
209
302

score 2 · Answer 7 · answered Dec 11 '09 at 22:58

Note: Linguistics is a hobby, but I am in no way an expert at it.

First you need to get a "dictionary" so to speak of English Phonemes.

Then you simply string them together.

While not the most complex and accurate solution, it should lead you to a generally acceptable outcome.

Far simpler to implement if you don't understand the complexities of the other solutions mentioned.

score 2 · Answer 8 · answered Dec 11 '09 at 22:59

2

Using Markov chains is an easy way, as already pointed out. Just be careful that you don't end up with an Automated Curse Generator.

answered Dec 11 '09 at 22:59

Tim Sylvester

22,897
2
80
94

score 2 · Answer 9 · answered Dec 11 '09 at 23:37

2

Use n-grams based off the English corpora with n > 3, that gets you an approximation.

answered Dec 11 '09 at 23:37

Paul Nathan

39,638
28
112
212

score 2 · Answer 10 · answered Dec 12 '09 at 01:53

2

I can't think of any cromulent ways of doing this.

answered Dec 12 '09 at 01:53

Dan Lorenc

5,376
1
23
34

2

;-) This kind of humorous tidbits is most welcome in SO. (Helps us keep with otherwise terse material and also stops us from taking ourselves too seriously. This said this kind of of lines should be placed as a comment to the question, not as an answer! Thanks. – mjv Dec 14 '09 at 05:26

score 0 · Answer 11 · answered Dec 11 '09 at 22:56

0

A common practice is to build a Markov Chain based on the letter transitions in a "training set" made of several words (noums?) from an English lexicon, and to then let this chain produce "random" words for you.

answered Dec 11 '09 at 22:56

mjv

73,152
14
113
156

score 0 · Answer 12 · answered Dec 11 '09 at 23:36

Markov chain is the way to go, as others have already posted. Here is an overview of the algorithm:

Let H be a dictionary mapping letters to another dictionary mapping letters to the frequency they occur with.
Initialize H by scanning through a corpus of text (for example, the Bible, or the Stack Overflow public data). This is a simple frequency count. An example entry might be H['t'] = {'t': 23, 'h': 300, 'a': 50}. Also create a special "start" symbol indicating the beginning of a word, and an "end" symbol for the end.
Generate a word by starting with the "start" symbol, and then randomly picking a next letter based on the frequency counts. Generate each additional letter based on the last letter. For example, if the last letter is 't', then you will pick 'h' with probability 300/373, 't' with probability 23/373, and 'a' with probability 50/373. Stop when you hit the "end" symbol.

To make your algorithm more accurate, instead of mapping one letter to the next letters, you could map two letters to the next letter.

score 0 · Answer 13 · answered Dec 30 '09 at 15:15

If you decide to go with a simple approach like the code Andy West suggested, you might get even better results by weighting the frequencies of vowels and consonants to correspond with those occurring normally in the English language: Wikipedia: Letter Frequency

You could even go as far as looking at the frequencies of paired letters or sequences of three letters, but at that point you're actually implementing the same idea as the Markov chain others have suggested. Is it more important that the "fake words" look potentially authentic to humans, or are the statistical properties of the words more important, such as in cryptographic applications?

How do you create an English like word?

13 Answers13