You can take a different route; divide the amount of words in by the possible combinations.
From a dictionary make a set of words for a given length, e.g. 6 letters:
with open('words.txt') as words:
six_letters = {word for word in words.read().splitlines()
if len(word) == 6}
The amount of six letter words is len(six_letters)
.
The amount of combinations of six lowercase letters is 26 ** 6
.
So the probability of getting a valid six letter word is:
len(six_letters) / 26 ** 6
edit: Python 2 uses floor division so will give you 0
.
You can convert either the numerator or denominator to a float
to get a non-zero result, e.g.:
len(six_letters) / 26.0 ** 6
Or you can make your Python 2 code behave like Python 3 by importing from the future:
from __future__ import division
len(six_letters) / 26 ** 6
Which, with your word list, both give us:
9.67059707562e-05
The amount of 4 letter words is 7185
. There's a nice tool for collecting histogram data in the standard library, collections.Counter
:
from collections import counter
from pprint import pprint
with open(words_file) as words:
counter = Counter(len(word.strip()) for word in words)
pprint(counter.items())
The values from your file give:
[(1, 26),
(2, 427),
(3, 2130),
(4, 7185),
(5, 15918),
(6, 29874),
(7, 41997),
(8, 51626),
(9, 53402),
(10, 45872),
(11, 37538),
(12, 29126),
(13, 20944),
(14, 14148),
(15, 8846),
(16, 5182),
(17, 2967),
(18, 1471),
(19, 760),
(20, 359),
(21, 168),
(22, 74),
(23, 31),
(24, 12),
(25, 8),
(27, 3),
(28, 2),
(29, 2),
(31, 1)]
So, most words, 53402
, in your dictionary have 9
letters. There are roughly twice as many 5
as 4
letter, and twice as many 6
as 5
letter words.