0

The below code takes 2 seconds to finish. The code looks clean but is very inefficient.

I am trying to pre-generate the ways you can build up to a total of max_units in increments of 2.

I'd then filter the created table to where secondary_categories meet certain criteria:

  • 'A' is >10% of the total and 'B'<=50% of the total.

Do you see a better way to get the combinations in increments of 2 that meet criteria like the above?

import itertools
import pandas as pd

primary_types= ['I','II']                
secondary_categories= ['A','B'] 

unitcategories= len(primary_types)*len(secondary_categories) #up to 8

min_units= 108; max_units= 110 #between 20 and 400
max_of_one_type= max_units


args =[[i for i in range(2,max_of_one_type, 2)] for x in range(unitcategories)]

lista= list(itertools.product(*args))

filt= [True if max_units>=l>=min_units else False for l in list(map(sum, lista))]

lista= list(itertools.compress(lista, filt))

df=pd.DataFrame(lista, columns= pd.MultiIndex.from_product([primary_types, secondary_categories], names=['', '']))

df['Total']=df.sum(axis=1)

df

Extending the following makes it take significantly longer or run out of memory: primary_types, secondary_categories, min_units, max_units.

Thank you

Kdog
  • 503
  • 5
  • 20
  • 2
    I would *start* by not needlessly calling `list` on every iterator in your code, thereby defeating the *purpose* of using iterators... – juanpa.arrivillaga Jun 17 '20 at 20:40
  • 1
    It seems to me you can leave out the intermediate `filt` list and dispense with using `itertools.compress` and just filter in a list comprehension... – juanpa.arrivillaga Jun 17 '20 at 20:41
  • 1
    Note also, args *can just be `[range(2, max_of_one_type, 2)] for x in range(unitcategories)]`.... why doy ou needlessly make lists out of everything? like, specifically `[i for i in range(2,max_of_one_type, 2)]`, anything of the form `[i for i in whatever]` can just be `list(whatever)`, but in this case, **there is no point in converting your `range` object into a lsit to begin with...** – juanpa.arrivillaga Jun 17 '20 at 20:42
  • Thank you. Do you have any insight about the problem of generating too much data for the available memory? – Kdog Jun 17 '20 at 20:54
  • 1
    Do you have to work in increments of 2? Everything throughout the entire algorithm is an even number so you can get an easy win by dividing everything by 2 at the start. Then, when it's all done, multiply the result you want by 2. – Peaceful James Jun 17 '20 at 21:10
  • It's actually a really cool problem. I like it a lot. – Peaceful James Jun 17 '20 at 21:31

1 Answers1

2

OK so I'm posting this just FYI but I don't think it's an ideal solution. I believe there exists a far more elegant solution and I bet it involves numpy. However, this should at least be faster than the OP:

import itertools
import pandas as pd


primary_types = ["I", "II"]
secondary_categories = ["A", "B"]

unitcategories = len(primary_types) * len(secondary_categories)  # up to 8

min_units = 54
max_units = 55  # between 10 and 200
max_of_one_type = max_units

args = [range(1, max_of_one_type) for x in range(unitcategories)]

lista = [x for x in itertools.product(*args)if max_units >= sum(x) >= min_units]

df = pd.DataFrame(
    lista,
    columns=pd.MultiIndex.from_product(
        [primary_types, secondary_categories], names=["", ""]
    ),
)

df["Total"] = df.sum(axis=1)

df = df * 2  # multiply by 2 to get the result you want
  1. I divided everything by 2 at the start and multiplied the result at the end by 2.
  2. I removed all unnecessary uses of list
  3. I removed the itertools.compress and filt and instead just put an if in the list comprehension (where lista is declared and assigned)
Peaceful James
  • 1,807
  • 1
  • 7
  • 16
  • Thank you. Your solution cut the processing time in half. But if I extend the secondary_categories by adding an element "C" like ["A", "B", "C"], the main problem I'm having become much more apparent - the code doesn't finish – Kdog Jun 17 '20 at 21:32
  • Yes, I imagine that would be the case. I would need more time to work on this problem in an abstract way. There is a pattern to the tuples whose sums are within the specified range and I'm certain there exists a numpy function that would do the job perfectly. I cannot look at this any more now but I will revisit it again in the near future. Good luck, fellow coder! :) – Peaceful James Jun 17 '20 at 21:34
  • 1
    @Kdog yes, because your algorithm is polynomial time. – juanpa.arrivillaga Jun 17 '20 at 21:36
  • @juanpa.arrivillaga Could have been worse right? Could have been exponential. I have a feeling I'll end up converting the code to loops and distributing over multiple cores to generate the options that add up to the total at the increments of 2. Unfortunately a formulaic solution would give continuous values. – Kdog Jun 17 '20 at 21:54
  • @Kdog Parallelizing isn't really going to help the polynomial time... But it may be just enough for what you are trying to do – juanpa.arrivillaga Jun 17 '20 at 22:09
  • @PeacefulJames Just as an FYI, this here looks really interesting https://stackoverflow.com/questions/57166794/tree-to-find-all-cartesian-product-of-lists-whose-product-is-greater-than-a-thre – Kdog Jun 17 '20 at 22:58