1

My Problem

I am trying to generate a small subset of possible combinations from a very large Cartesian Product. My input would be an array of arrays, but the size of each array is dynamic. For now, I am using Python, but I am open to any language that needs to be used.

My Goal

After seeing this question: How to select specific item from cartesian product without calculating every other item, I thought that was an amazing algorithm that could generate a set given an index. However, that only specifically works with 3 arrays. My end goal would be something like this, where the function that determines the set is find_set:

Input:
A = [ A_0, A_1, A_2, ..., A_n ]
B = [ B_0, B_1, B_2, ..., B_n ]
C = [ C_0, C_1, C_2, ..., C_n ]
D = [ D_0, D_1, D_2, ..., D_n ]
...
N = [ N_0, N_1, D_2, ..., N_n ]

List = [ A, B, C, D, ... N]

find_set(List, 0) -> [ A_0, B_0, C_0, ..., N_0 ]
find_set(List, 1) -> [ A_0, B_0, C_0, ..., N_1 ]
...

And so on and so forth, for any given index.

What I've Done So Far

I was using Python 2.7 and itertools.product to generate all of the combinations, but this only makes an iterator. After running into memory consumption problems, I tried something like this:

results = itertools.product(*List)

# generates 100,000 random indices between 0 and the size of the Cartesian Product
desired_indices = random.sample(xrange(0, calculated_size - 1), 100000) 

for item, index in results:
    if index in desired_indices:
          # do things

The thing is, this leads to O(N) operations regardless, and when I have 433,501,216 possible sets, this will take a VERY long time just to find an incredibly small subset. I appreciate all the help and any other resources I can look to for more knowledge on the subject.

Tyler
  • 51
  • 1
  • 7

2 Answers2

1

I don't speak Python, but I wrote an iterator for Scala which needn't store intermediate results, just an index.

Tell me, if you need more information about the syntax.

The basic idea is as following: If you have two Lists, one of 3 and one of two elements, (a,b,c) and (1,2), you can generate (a1, a2, then b1, b2, lastly c1, c2). It' entirely controlled by the length of each list, so it must be possible to calculate the size of the cartesian product in advance (2*3) or the product of lengths in the general case, and by taking the modulo of each length in the right order, for every number in (0..size-1) returning a different set of elements of it.

class CartesianIterator [T] (val ll: Seq[Seq[T]]) extends Iterator [Seq[T]] { // with IndexedSeq [Seq[T]] {
  var current = 0L
  override val size = ll.map (_.size).product

  def get (n: Long): List[T] = {

      def get (n: Long, lili: Seq[Seq[T]]): List[T] = lili.length match {
        case 0L => List ()
        case _ => {
          val inner = lili.head
          inner ((n % inner.size).toInt) :: get (n / inner.size, lili.tail)
        }
      }

      get (n, ll)
  }

  override def hasNext () : Boolean = current != size
  override def next (): Seq[T] = {
    current += 1
    get (current - 1)
  }

    // IndexedSeq: Selects an element by its index in the sequence.
    // val x = CartesianIterator (...)
    // val res = x(123) // = x.apply (123)
    def apply (idx: Long): Seq[T] = {
        current = idx-1L
        next ()
    }
}

def exhaustiveDemo () {
  val ci = new CartesianIterator (List(List ('a', 'b'), List (1, 2, 3, 4), List ("foo", "bar", "baz")))
  for (p <-ci) println (p)
}

def massiveDemo () {
    val r = util.Random
    // 8 bit per value, ...
    val li = (0 to 256).toList
    // 256^8 combinations, 0 to Long.MaxValue
    val ll = List (li, li, li, li, li, li, li, li)
    val ci = new CartesianIterator (ll)
    for (i <- 0 to 9;
        rnd = math.abs(r.nextLong ());
        tuple = ci.get (rnd)
    ) println (tuple.mkString (":") + " at idx: " + rnd)
}

exhaustiveDemo ()
massiveDemo ()

Sample output:

List(a, 1, foo)
List(b, 1, foo)
List(a, 2, foo)
//...
List(b, 3, baz)
List(a, 4, baz)
List(b, 4, baz)

92:167:65:79:242:74:167:67 at idx: 5009630557992325817
252:176:16:94:68:30:43:44 at idx: 3270674835001624787
113:63:109:2:2:184:2:82 at idx: 6072977656466246445
95:68:232:237:171:233:183:114 at idx: 8494823201479688501
181:241:90:213:40:128:134:57 at idx: 4259670111450561340
29:113:165:134:150:89:247:72 at idx: 5402953717167499239
87:30:93:211:245:210:1:83 at idx: 6146770892844404160
25:205:116:230:196:105:62:37 at idx: 2757875964952116944
194:68:71:160:63:57:204:41 at idx: 3094941632910767450
166:133:37:249:17:6:215:92 at idx: 6874662895836376467
user unknown
  • 35,537
  • 11
  • 75
  • 121
  • If I'm reading this right, it looks like you're just generating the Cartesian Product without storing the entire thing, and iterating to the next element as necessary, correct? So if I wanted to access the element at, say, index 54,377 would I be able to directly produce that? Or do I need to iterate through each and every one until I reach that index? If the latter is the case, you'd still be left with O(n) performance – Tyler Mar 20 '18 at 02:53
  • Yes, you just set the index to the desired value and calculate just that element at the position, not the 54.376 before. Or the 54 millions before, billions or whatever you like. MassiveDemo shows just that. If you initialize random with a constant seed, you will always get the same results. Unfortunately, the collection base classes are expected to be of maximum size int so that using a long is already problem. Using BigInt would need a rewrite of the libraries expecting an Int and break the flawless integration with other tools/libs. Maybe more easy in Python. – user unknown Mar 20 '18 at 07:36
  • For each element, the access time is proportional to the number of Lists (or Seqs) in the collection. For Element 54377 you would call `ci.get (54377, ll)` or better, supply a method 'get' which only takes the index `ci.get (54377)` since the List of Lists is known as an entry point to the recursive get method, which expects the list to be a parameter. – user unknown Mar 20 '18 at 07:43
  • Thank you for posting this, your answer did help me do more research as well as guide my own implementation(: – Tyler Mar 20 '18 at 20:44
1

I was actually able to figure it out myself. In case anyone else comes across this, here is an implementation in Python:

import math

class LazyCartesianProduct:
    def __init__(self, sets):
        self.sets = sets
        self.divs = []
        self.mods = []
        self.maxSize = 1
        self.precompute()

    def precompute(self):
        for i in self.sets:
            self.maxSize = self.maxSize * len(i)
        length = len(self.sets)
        factor = 1
        for i in range((length - 1), -1, -1):
            items = len(self.sets[i])
            self.divs.insert(0, factor)
            self.mods.insert(0, items)
            factor = factor * items

    def entryAt(self, n):
        length = len(self.sets)
        if n < 0 or n >= self.maxSize:
            raise IndexError
        combination = []
        for i in range(0, length):
            combination.append(self.sets[i][ int(math.floor(n / self.divs[i])) % self.mods[i]])
        return combination

This implements the algorithm from this website: http://phrogz.net/lazy-cartesian-product

Tyler
  • 51
  • 1
  • 7