How to extract real words from a code that generates a random set of letters

Question

I wanna find out the average number of real words that would show up in a set of randomly generated letters. is there a pythonic way to do this?

I've managed to figure out how to generate a set of 1000 random letters 1000 times but i have no idea on how to go about counting the numbers of real word effciently.

This is what I have so far

Potato=0

import string
import random
def text_gen(size=100, chars=string.ascii_uppercase + string.ascii_lowercase):
    return ''.join(random.choice(chars) for _ in range(size))

while True:
    print (text_gen(1000))
    Potato=Potato+1
    if Potato==1001:
        break

From the string generated, how would I be able to filter out only the parts that make sense?

For example if the text generated is gdlkfghiwmfefirekjfewlklphonelkfdlfk, it would result in: ``` Words Generated: Fire Phone Total Words: 2 ``` — E - Dongonbe, Mar 26 '19 at 07:52
Yeah, all the words generated + total count of the words, then from there i'll figure out how to count numbers of lengths of the words generated — E - Dongonbe, Mar 26 '19 at 07:55
How do you define a "real word"? That is for you to decide and put into code — zvone, Mar 26 '19 at 07:57
Anything in the english dictionary i guess? [Here is a .txt with over 450k english words](https://raw.githubusercontent.com/dwyl/english-words/master/words.txt) — E - Dongonbe, Mar 26 '19 at 07:58
In that case you would need a "list" of all words in the english dictionary and then look for substrings in your 1000-letter-string which match an element from your dictionary list - the challenge I see here is to create a "list" of all words of the english language. — Cribber, Mar 26 '19 at 08:00
Why use [Monte Carlo](https://en.wikipedia.org/wiki/Monte_Carlo_method) when you can use [Combinatorics](https://en.wikipedia.org/wiki/Combinatorics)? — Peter Wood, Mar 26 '19 at 08:02
There are also APIs which you could use - https://www.dictionaryapi.com/ or https://dictionary-api.cambridge.org/ - depending on your use they might require a fee though — Cribber, Mar 26 '19 at 08:04
`phone` contains `one` and `on`, do you want to count all those words? — Aemyl, Mar 26 '19 at 08:05
Is there a way I could use the raw .txt to create a list somehow? — E - Dongonbe, Mar 26 '19 at 08:06

Peter Wood · Accepted Answer · 2019-03-27T16:17:25.140

1

You can take a different route; divide the amount of words in by the possible combinations.

From a dictionary make a set of words for a given length, e.g. 6 letters:

with open('words.txt') as words:
    six_letters = {word for word in words.read().splitlines()
                   if len(word) == 6}

The amount of six letter words is len(six_letters).

The amount of combinations of six lowercase letters is 26 ** 6.

So the probability of getting a valid six letter word is:

len(six_letters) / 26 ** 6

edit: Python 2 uses floor division so will give you 0.

You can convert either the numerator or denominator to a float to get a non-zero result, e.g.:

len(six_letters) / 26.0 ** 6

Or you can make your Python 2 code behave like Python 3 by importing from the future:

from __future__ import division

len(six_letters) / 26 ** 6

Which, with your word list, both give us:

9.67059707562e-05

The amount of 4 letter words is 7185. There's a nice tool for collecting histogram data in the standard library, collections.Counter:

from collections import counter
from pprint import pprint

with open(words_file) as words:
    counter = Counter(len(word.strip()) for word in words)

pprint(counter.items())

The values from your file give:

[(1, 26),
 (2, 427),
 (3, 2130),
 (4, 7185),
 (5, 15918),
 (6, 29874),
 (7, 41997),
 (8, 51626),
 (9, 53402),
 (10, 45872),
 (11, 37538),
 (12, 29126),
 (13, 20944),
 (14, 14148),
 (15, 8846),
 (16, 5182),
 (17, 2967),
 (18, 1471),
 (19, 760),
 (20, 359),
 (21, 168),
 (22, 74),
 (23, 31),
 (24, 12),
 (25, 8),
 (27, 3),
 (28, 2),
 (29, 2),
 (31, 1)]

So, most words, 53402, in your dictionary have 9 letters. There are roughly twice as many 5 as 4 letter, and twice as many 6 as 5 letter words.

edited Mar 27 '19 at 16:17

answered Mar 26 '19 at 08:11

Peter Wood

23,859
5
60
99

I think this might be close enough, let me run it a couple of times and i'll show you results. – E - Dongonbe Mar 26 '19 at 08:15
@E-Dongonbe: Suggestion: you could use this to store your entire list of English words in a list-of-sets, sorted by length. i.e the list at `[i]` a set of all the words of length `i+1`, etc. The length of the list would tell you the longest possible word length and the associated set would make checking for membership very fast. – martineau Mar 26 '19 at 09:08
@Peter Wood i've tried to run it a few times and i've changed the number of letters. it does not seem to run correctly, could my [source .txt](https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt) be flawed? the words are separated by lines. – E - Dongonbe Mar 26 '19 at 10:49
@E-Dongonbe if you're using Python 2 then it uses [floor division](https://stackoverflow.com/questions/183853/what-is-the-difference-between-and-when-used-for-division). Dividing a smaller integer by a larger integer will give `0`. I'll update the answer. – Peter Wood Mar 26 '19 at 13:10
@PeterWood sorry for the late response. I'm using python 3, when i try to find the number of words it shows me 7185 for four letters words, i think it should be a lot more than that. Thanks for explaining floor division tho! – E - Dongonbe Mar 27 '19 at 06:20
@E-Dongonbe I checked the words and `7185` it seems reasonable. I'll update the question. – Peter Wood Mar 27 '19 at 07:53
@PeterWood Oh, well i guess i didn't expect these results. Frankly thought 4 or 5 letter words would be the most form the sample. Thank you very much for your help I learned a lot working on this. – E - Dongonbe Mar 27 '19 at 11:48

score 0 · Answer 2 · answered Mar 26 '19 at 08:09

It is up to you to define what real words are > create your own list of words. I made the following solution with your comment as random string:

dictionary = ['fire', 'phone']
random_string = 'gdlkfghiwmfefirekjfewlklphonelkfdlfk'
total_words = 0
for word in dictionary:
    total_words += random_string.count(word)
print(total_words)

>>> 2

Which can be refactored into the following code where you create a list with the count of each word in your dictionary and then get a sum of all these counts:

dictionary = ['fire', 'phone']
random_string = 'gdlkfghiwmfefirekjfewlklphonelkfdlfk'
total_words = sum([random_string.count(word) for word in dictionary]) # List comprehension to create a list, then sum the content of the list
print(total_words)

>>> 2

score 0 · Answer 3 · answered Mar 26 '19 at 09:03

Well combine each generated word with a request on https://developer.oxforddictionaries.com/ they have an API which may be useful for your purposes and the also have a basic python example using requests. Or you may find any other API for example Google translate API and check for error returns (i personally have not used any and i do not know what they return if you have a misspelled word but it should not be hard to find out)

Last but not least use requests and beautiful soup to send requests to a dictionary page and read the results. (the best would be to request google translate but it will block you after few results)

How to extract real words from a code that generates a random set of letters

3 Answers3