0

I am trying to generate readable word-like random strings not found in any dictionaries using Markov Chain.

I have pulled a large amount of data of ngram frequencies from a total of 105230 words pulled from GCIDE, and currently these data are stored in Counter format (serialized as JSON), and utilizing Markov chain involves randomly choose elements from a set with weights.

I have already found a way to do weighted random sample, like this:

random.choices(keys, weights=values, k=1)

(keys and values are pulled from the Counter)

But all the tutorials I have found are implementing Markov chains using numpy, and to use this method I need to convert the integers into permillages of the total and ensure the numbers add up to 1.0.

As I said I want the numbers in permillage format (float with three decimal places) and the floats must add up to 1.0 in order to make the numpy method work.

I can convert the numbers into floats but due to precision limits inherent to the 53-bit double precision floating point format the numbers won't always add up to 1.0.

For example:

initcon = {'c': 7282,
 'm': 6015,
 'd': 5866,
 'p': 5699,
 's': 5294,
 'b': 4103,
 'r': 4097,
 'h': 3926,
 'l': 3352,
 't': 2841,
 'f': 2699,
 'n': 2171,
 'g': 2051,
 'pr': 1991,
 'v': 1626,
 'tr': 1337,
 'w': 1337,
 'st': 1153,
 'ch': 1121,
 'cr': 827,
 'br': 803,
 'j': 799,
 'sp': 746,
 'gr': 694,
 'k': 676,
 'ph': 651,
 'pl': 645,
 'fl': 622,
 'th': 594,
 'sh': 572,
 'q': 553,
 'cl': 538,
 'fr': 522,
 'sc': 516,
 'bl': 494,
 'gl': 428,
 'dr': 421,
 'z': 376,
 'wh': 338,
 'str': 335,
 'sl': 325,
 'sw': 245,
 'rh': 210,
 'sk': 167,
 'sn': 165,
 'scr': 148,
 'sm': 143,
 'x': 143,
 'chr': 141,
 'kn': 139,
 'thr': 125,
 'sq': 124,
 'ps': 123,
 'wr': 113,
 'sch': 106,
 'tw': 95,
 'spr': 73,
 'spl': 72,
 'shr': 66,
 'sph': 65,
 'chl': 54,
 'pt': 51,
 'gn': 49,
 'phl': 41,
 'scl': 39,
 'gh': 37,
 'pn': 37,
 'phr': 33,
 'kr': 30,
 'kl': 22,
 'dw': 16,
 'kh': 15}

total = sum(initcon.values())

initcon = {k: v/total for k, v in initcon.items()}
print(sum(initcon.values()))

It prints 0.9999999999999999.

How can I make the numbers in initcon add up to exactly 1.0 and make them each have exactly 3 decimal places?

Ξένη Γήινος
  • 2,181
  • 1
  • 9
  • 35
  • if you want them to round to 3 decimal places you can just do `round(value, 3)` – ConnerWithAnE Oct 14 '21 at 06:06
  • “Permillage” is a rare word that should be defined when used, and the numbers you are using are not permillages. The sum of the permillages of a partition add to 1000, not 1. E.g., for pieces that are a a tenth, a quarter, and 65% of a whole, the permillages are 100, 250, and 650, with a total of 1000. – Eric Postpischil Oct 14 '21 at 10:50
  • Re “How can I make the numbers in initcon add up to exactly 1.0 and make them each have exactly 3 decimal places?”: That is impossible in general when using binary-based floating-point. The only numbers between 0 and 1 with exactly three decimal places are 0, .125, .250, .375, .500, .625, .750, .875, and 1. For example, there is no binary-based floating-point number whose value is .123. – Eric Postpischil Oct 14 '21 at 10:56
  • Aside from the question I marked this as a duplicate of, I recall one more duplicate that got a fair amount of discussion, but I cannot find it at the moment. Maybe somebody else can. – Eric Postpischil Oct 14 '21 at 11:04

1 Answers1

0

I had a similar problem at work as well. The only way we found to maintain the sum equal to 1 was to add an additional step.

After

initcon = {k: v/total for k, v in initcon.items()}

you can check what is the remainder so that sum(initcon.values())==1 i.e.

remainder = 1-sum(initcon.values())

Then you can add this number to any of the keys. You can pick it randomly with random.choice method. So, in the end, you will do something like:

initcon[random_key]+=remainder
Python learner
  • 1,159
  • 1
  • 8
  • 20
tzinie
  • 717
  • 5
  • 9