2

What I want to achieve:

I have a function where I want to loop through all possible combinations of printable ascii-characters, starting with a single character, then two characters, then three etc.

The part that makes this difficult for me is that I want this to work for as many characters as I can (leave it overnight).

For the record: I know that abc really is 97 98 99, so a numeric representation is fine if that's easier.


This works for few characters:

I could create a list of all possible combinations for n characters, and just loop through it, but that would require a huge amount of memory already when n = 4. This approach is literally impossible for n > 5 (at least on a normal desktop computer).

In the script below, all I do is increment a counter for each combination. My real function does more advanced stuff.

If I had unlimited memory I could do (thanks to Luis Mendo):

counter = 0;
some_function = @(x) 1;
number_of_characters = 1;
max_time = 60;
max_number_of_characters = 8;
tic;
while toc < max_time && number_of_characters < max_number_of_characters
    number_of_characters = number_of_characters + 1;
    vectors = [repmat({' ':'~'}, 1, number_of_characters)];
    n = numel(vectors);
    combs = cell(1,n);
    [combs{end:-1:1}] = ndgrid(vectors{end:-1:1});
    combs = cat(n+1, combs{:});
    combs = reshape(combs, [], n);
    for ii = 1:size(combs, 1)
        counter = counter + some_function(combs(ii, :));
    end
end

Now, I want to loop through as many combinations as possible in a certain amount of time, 5 seconds, 10 seconds, 2 minutes, 30 minutes, so I'm hoping to create a function that's only limited by the available time, and uses only some reasonable amount of memory.


Attempts I've made (and failed at) for more characters:

I've considered pre-computing the combinations for two or three letters using one of the approaches above, and use a loop only for the last characters. This would not require much memory, since it's only one (relatively small) array, plus one or more additional characters that gets looped through.

I manage to scale this up to 4 characters, but beyond that I start getting into trouble.


I've tried to use an iterator that just counts upwards. Every time I hit any(mod(number_of_ascii .^ 1:n, iterator) == 0) I increment the m'th character by one. So, the last character just repeats the cycle !"# ... ~, and every time it hits tilde, the second character increments. Every time the second character hits tilde, the third character increments etc.


Do you have any suggestions for how I can solve this?

Community
  • 1
  • 1
CG.
  • 263
  • 1
  • 8
  • What is the maximum number of characters in your output string? – beaker Jul 29 '16 at 14:55
  • I'm not sure how many combinations my computer can loop through in a few hours, but it should preferably work for that many. I would think it reaches a limit around 6 or 7, but I'm not sure... – CG. Jul 29 '16 at 22:33
  • As Prophecies mentions below, this is counting in the base of the number of tokens, in your case 95. `95^5 = 7.7378e+09`. That's going to take a while. – beaker Jul 29 '16 at 22:39
  • It shouldn't be more than a couple of hours tops? On my slow 4 year old laptop, using Octave I can increment a counter `1e7` times in about 30 seconds. Multiplying that by 7e2 should give roughly 6 hours. I guess I can do this a whole lot faster with Matlab and on a faster computer... But I'll go with whatever is feasible. =) – CG. Jul 29 '16 at 22:51
  • All combinations, so if the set was only `abc`, and I want only two characters, then I want: `aa, ab, ac, ba, bb, bc, ca, cb, cc`. The sorting doesn't matter as long as all combinations are there. – CG. Jul 29 '16 at 23:12
  • Here's an experiment for you: `tic; for d = 0:36^5-1 b=dec2base(d, 36); end; toc`. This will give you the strings of length 5 using the characters `[0:9, A-Z]`. The total number of strings produced will be 36^5 or about 6e7. This will be a much better estimate of how long your task will take than an empty loop. – beaker Jul 29 '16 at 23:35
  • Hmm are you sure you do have to loop over all these combinations? If you'd explain a bit what is the final goal it might be possible to find some tricks to reduce computation time – BillBokeey Aug 02 '16 at 08:53

2 Answers2

0

It looks like you're basically trying to count in base-26 (or base 52 if you need CAPS). Each number in that base will account for a specific string of character. For example,

0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,10,11,12,...

Here, cap A through P are just symbols that are used to represent number symbols for base-26 system. The above simply represent this string of characters.

a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,ba,bb,bc,...

Then, you can simply do this:

symbols = ['0','1','2','3','4','5','6','7','8','9','A','B','C','D','E',...
        'F','G','H','I','J','K','L','M','N','O','P']
characters = ['a','b','c','d','e','f','g','h','i','j','k','l',...
        'm','n','o','p','q','r','s','t','u','v','w','x','y','z']
count=0;
  while(true)
     str_base26 = dec2base(count,26)
     actual_str = % char-by-char-lookup-of-str26 to chracter string
     count=count+1;
  end

Of course, it does not represent characters that begin with trailing 0's. But that should be pretty simple.

Prophecies
  • 723
  • 1
  • 7
  • 19
  • I don't think this solves it. I believe it would be base-94, since there are 94 distinct ASCII-characters. MATLAB can only handle up to base-35 or base-36 – Stewie Griffin Jul 29 '16 at 22:21
  • @StewieGriffin Base-95? Octave's `dec2base` could handle it, but not MATLAB's. You could always build your own base converter. – beaker Jul 29 '16 at 22:35
  • Actually, Octave couldn't handle it. It won't allow whitespace as a symbol. – beaker Jul 29 '16 at 22:45
  • @beaker, really? Doesn't Octave handle whitespace? How ridiculous is that? – CG. Jul 29 '16 at 23:01
  • @user4694 It doesn't allow whitespace as a symbol for a numeric base. It handles whitespace just fine in strings. :) – beaker Jul 29 '16 at 23:03
  • @beaker, OK, that makes more sense... :) – CG. Jul 29 '16 at 23:04
0

You were not far with your idea of just getting an iterator that just counts upward.

What you need with this idea is a map from the integers to ASCII characters. As StewieGriffin suggested, you'd just need to work in base 95 (94 characters plus whitespace).

Why whitespace : You need something that will be mapped to 0 and be equivalent to it. Whitespace is the perfect candidate. You'd then just skip the strings containing any whitespace. If you don't do that and start directly at !, you'll not be able to represent strings like !! or !ab.


First let's define a function that will map (1:1) integers to string :

function [outstring,toskip]=dec2ASCII(m)


    out=[];

    while m~=0

        out=[mod(m,95) out];

        m=(m-out(1))/95;

    end

    if any(out==0)

        toskip=1;

    else

        toskip=0;

    end

    outstring=char(out+32);

end

And then in your main script :

counter=1;
some_function = @(x) 1;
max_time = 60;
max_number_of_characters = 8;
currString='';


tic;


while numel(currString)<=max_number_of_characters&&toc<max_time

    [currString,toskip]=dec2ASCII(counter);

    if ~toskip

        some_function(currString);

    end

    counter=counter+1;

end

Some random outputs of the dec2ASCII function :

dec2ASCII(47)

ans =

O

dec2ASCII(145273)

ans =

0)2

In terms of performance I can't really elaborate as I don't know what you want to do with your some_function. The only thing I can say is that the running time of dec2ASCII is around 2*10^(-5) s


Side note : iterating like this will be very limited in terms of speed. With the function some_function doing nothing, you'd just be able to cycle through 4 characters in around 40 minutes, and 5 characters would already take up to 64 hours. Maybe you'd want to reduce the amount of stuff you want to pass through the function you iterate on.

This code, though, is easily parallelizable, so if you want to check more combinations, I'd suggest trying to do it in a parallel manner.

BillBokeey
  • 3,168
  • 14
  • 28