-1

i tried to Generate 7 million random strings that do not overlap.(String length must 2 to 4)

However, my code took 28023 ms to generate 7 million non-overlapping strings.

I don't know how to efficiently generate 7 million strings without overlap.

i tried to use hashmap, list, ... because I need a key-value type

    HashMap<String, Integer> map = new HashMap();
    long start = System.currentTimeMillis();

    for(int i = 0 ; i < 7000000 ; ) {
        int MAX_LENGTH = 4;
        int MIN_LENGTH = 2;

        StringBuffer temp = new StringBuffer();
        Random rnd = new Random();
        int length = rnd.nextInt(MAX_LENGTH - MIN_LENGTH + 1) + MIN_LENGTH; // 문자열 길이 랜덤(2~4자리)

        for (int j = 0; j < length; j++) {
            int rIndex = rnd.nextInt(2);
            switch (rIndex) {
                case 0:
                    // a-z(소문자)
                    temp.append((char) ((int) (rnd.nextInt(26)) + 97));
                    break;
                case 1:
                    // A-Z(대문자)
                    temp.append((char) ((int) (rnd.nextInt(26)) + 65));
                    break;
            }
        }


        String str = temp.toString();

        if(!map.containsKey(str)) {
            map.put(str, rnd.nextInt());
            i++;
        }
    }

    long end = System.currentTimeMillis();

    System.out.println("Setup Performance : " + (end - start));

my code took 28023 ms to generate 7 million non-overlapping strings.

  • 3
    One suggestion is to move `Random rnd = new Random();` out of the loop. This will not solve your problem entirely, though. – Peter O. Nov 09 '19 at 02:10
  • 2
    Less than half a minute? What is the problem? – Ole V.V. Nov 09 '19 at 03:33
  • Also replace `StringBuffer` with `StringBuilder` and move instantiation to before loop (create once) and use `setLength(0)` on each iteration. Eliminates GC significantly. –  Nov 09 '19 at 08:52
  • You are using 52 letters: A-Za-z. Think of your strings as 2, 3 or 4 digit numbers in base 52. Pick a number within the range and express it in base 52. There are plenty of ways to generate non-repeating random numbers from a range, such as format preserving encryption. – rossum Nov 09 '19 at 14:31

2 Answers2

5

So you want to generate 7 million strings, each with 2 to 4 characters, from a set of 52 characters (A-Z and a-z).

Some simple math tells us that there are 2,704 possible 2-character strings, 140,608 3-character strings, and 7,311,616 possible 4-character strings. That's a total of 7,454,928 possible strings.

So create an array that contains all the numbers from 0 to 7,454,927. Shuffle it, and pick the first 7 million from the array.

Of course, you'll have to write code that turns that number into one of your output strings. That's easy enough. 0 is "AA" 51 is "Az", and 52 is "BA". You're basically doing an integer-to-string conversion.

Here's the basic idea. You'll have to convert to Java.

arraySize = 7454928
numbers = new array[arraySize]

// populate array
for i = 0 to arraySize-1
    numbers[i] = i

shuffle(numbers)

for i = 0 to 7000000
    code = convertNumberToString(numbers[i])
    print code


convertNumberToString(n)
    Alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
    outputString = ""
    while (n > 0 || outputString.length < 2)
      result = n / 52 // length of alphabet
      remainder = n % 52;
      outputString.append(Alphabet[remainder])
      n = result;
    // outputString now contains the characters,
    // but they're reversed. So reverse the string.
    return outputString.reverse()
Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
0

To generate n random strings, we need at least O(n) steps. To check if a new string is already generated or not, you need a good search algorithm. So, the searching may require O(n^2), or O(nlogn), or O(logn), etc... depends on what algorithm you are using. How about removing the searching steps completely by adding a index (1,2,3...7,000,000 for example) prefixes for the random generated strings? If you don't like decimal prefix, you can use hexadecimal prefix, etc... Hope this can help to improve it a bit.

Added C# sample program.

using System;
using System.Collections.Generic;

namespace Test
{
    class Program
    {
        public static List<char> GeneratePrintableCharArray(int numFirstChars = 256)
        //  Go thru numFirstChars first Unicode characters and extract
        //  non-control and non-white space characters into a list.
        //  From the default first 256 unicode chars, we can have 189 characters.
        {
            List<Char> printableChars = new List<char>();
            for (int i = char.MinValue; i < numFirstChars && i <= char.MaxValue; i++)
            {
                char c = Convert.ToChar(i);
                if (!char.IsControl(c) && !char.IsWhiteSpace(c))
                {
                    printableChars.Add(c);
                }
            }
            return printableChars;
        }

        static void Main(string[] args)
        {
            //  If we want to use 2 "digits" to store the index upto 7 million,
            //  we need 2645-base number system

            //  extract the first 2645 Uniocde characters and the null character
            List<char> chars = GeneratePrintableCharArray(2713);
            chars.Insert(0, '\0');

            List<string> idxList = new List<string>();
            List<string> randomList = new List<string>();

            int maxNumStrings = 7000000;
            bool stop = false;

            //  1. Generate "2-digit" index string list and up-to 2 character string list
            for (int i = 0; i < chars.Count && !stop; i++)
            {
                for (int j = 0; j < chars.Count && !stop; j++)
                {
                    string s = "";


                    if (i > 0 && j > 0)
                    {
                        s += new string(chars[i], 1);
                        s += new string(chars[j], 1);

                        idxList.Add(s);
                    }
                    else if (i > 0)
                    {
                        s += new string(chars[i], 1);
                    }
                    else if (j > 0)
                    {
                        s += new string(chars[j], 1);
                    }

                    randomList.Add(s);

                    if (idxList.Count == maxNumStrings)
                    {
                        stop = true;
                    }
                }
            }

            //  2. Append random suffix for the index
            Random rnd = new Random();
            for (int i = 0; i < idxList.Count; i++)
            {
                idxList[i] += randomList[rnd.Next(0, randomList.Count - 1)];
            }
            //  Here, we have 7 mil random strings.

            //  3. Just for testing - to make sure we don't have multiple items with the same string
            HashSet<string> set = new HashSet<string>(); 
            for (int i = 0; i < idxList.Count; i++)
            {
                if (set.Contains(idxList[i]))
                {
                    Console.WriteLine("There is a bug. Duplicate string found: " + idxList[i]);
                }
                else
                {
                    set.Add(idxList[i]);
                }
            }

        }
    }
}
Tuan Le PN
  • 364
  • 1
  • 12