2

My question is not language specific. I'm having issues with getting the loop to process permutations. I'm attempting to code something to display all values for 26^x where x is the length of a string. No input string will be supplied so if x=1, it'll display a through z, if x=2 itll display aa through zz. az is seen as different from za.

More specifically, I'm wanting to run this for longer strings, 100+ characters in length in an attempt to see how many strings of a given length containing words as opposed to random letters.

Tim
  • 35,413
  • 11
  • 95
  • 121
Rickie Marsh
  • 53
  • 1
  • 7
  • 4
    Time complexity and number of words is n!, for 100 characters is 9 * 10^157. Any algorithm will take a LONG time just to make the words much less process them. – Jesus Ramos Apr 13 '12 at 05:07
  • 1
    (From what i understand) You can calculate the number of words for a length that your program would produce. Use a dictionary library to count the number of words of given length. Now, you can see number of words with random letter. – Priyank Bhatnagar Apr 13 '12 at 06:44
  • @JesusRamos You can toss a fair coin 1000001 times and simulating it will take 2^1000001 steps, but takes almost no time to predict if 'Heads' won or lost! – ElKamina Apr 13 '12 at 06:52
  • [Here I have a permuatation algorithm](http://stackoverflow.com/a/10117424/312172) which can be used with an iterator without generating the whole List of permutations, but one at a time, and an interface to call it with a random number for instance. – user unknown Apr 13 '12 at 07:56
  • @ElKamina I know that but in this case he wants to compare the output possibly with a list of english words (which is probably not that fast, even with hashing) – Jesus Ramos Apr 13 '12 at 21:12

3 Answers3

1

Per the comment on the question, it's somewhat impractical to try to enumerate all of the possible 100-character strings.

I would suggest the alternate strategy of generating random strings of the given length, rather than enumerating in a structured manner. Something like:

count = 0
for i from 0 to simulation_length:
    random_string = ''
    for j from 0 to string_length:
        random_string += random_char()
    // containsWord(string) checks if the random string contains a word
    // this is tricky in and of itself
    if (containsWord(random_string)) count++
...

The random sampling will give you a representation of the behavior across the whole space, as long as simulation_length is sufficient.

mfsiega
  • 2,852
  • 19
  • 22
  • 1
    You could do this much more directly by taking the total number of words for each length `n` and dividing by `n!`, which will be the portion of letter strings of length `n` that are words. I think the OP is asking about containing words as a subset, though, which is harder. – Danica Apr 13 '12 at 05:18
  • yeah that was my interpretation too (hence my answer which doesn't make a lot of sense otherwise) but the code doesn't really reflect it properly. being edited... – mfsiega Apr 13 '12 at 05:20
1

26^x where x is the length of a string ... I'm wanting to run this for longer strings, 100+ characters in length

You should forget about it.

Let's put things in perspective. There are 26 letters in english alphabet, so total number of strings with 100 characters in them is ...

3142930641582938830174357788501626427282669988762475256374173175398995908420104023465432599069702289330964075081611719197835869803511992549376

That's decimal number. At the speed of 1 string per millisecond it'll take 9.9*10^130 years to print them all. That's 7.3*10^120 times longer than universe has existed.

Get a word list or load dictionary into memory and use it instead.

SigTerm
  • 26,089
  • 6
  • 66
  • 115
  • I understood that much going into this. I plan to implement a manual check at random with the first two characters. If not possible to start a word, it abandons that path. I probably worded my question wrong as it is more starting at two characters, checking if a word is possible, if yes add another character and repeat until either words are not possible or the string length has been reached. If not possible, move to the next letter at that position. – Rickie Marsh Apr 13 '12 at 06:31
  • A decent amount of searching/processing can be eliminated by setting a few simple rules for the first two characters. If q is first, the second can only be a vowel. The same is mostly true for some other letters. 26^2 possible two letter combos, q for instance only has 5 valid combinations where it is the first letter. While it still won't be fun setting that many rules, it does remove quite a bit of the problem. Also, since I am considering strings with a specific word at a given location, it can be broken in two sections, before and after the word. – Rickie Marsh Apr 16 '12 at 10:28
  • What we now would like to see, is: How many strings of size 50, 51, 52, ... can be build from a dictionary with following wordlens: "2:183, 3:815, 4:3181, 5:6151, 6: 9317, 7: 11962, 8: 11979, 9: 10400, 10: 8065, ..."? Take your values from `for n in {2..20}; do echo -ne "$n\t"; egrep -v ".*'s" /usr/share/dict/american-english | egrep -c "^.{"$n"}$"; done` – user unknown May 09 '12 at 19:54
0

It would depend on your definition of 'word'. If 'a' is a word it is very easy to get a lower limit on probability of getting a word in a 100 character sequence (roughly 1- 1/e^4). Similarly you can consider 2 letter words and 3 letter words and refine the probability. After 4 or 5 letters this probability becomes really accurate because there are few longer words and them randomly occurring is very rare.

ElKamina
  • 7,747
  • 28
  • 43
  • It is to give more than one word in the given string length. If the user puts in 8, it could return "itisadog" or "wesaidno". Looking at it that way, having a dictionary and looking for all words to add up to the given length seems better – Rickie Marsh Apr 13 '12 at 10:23
  • @RickieMarsh: But you don't expect them to make sense? So `nosaidwe` and `nonoweno` would be fits as well? – user unknown May 09 '12 at 16:23