4

I have posted a bit here related to a project I have been trying to work on and I keep hitting design problems and have to design from scratch. So I'm wondering if I can post what I'm trying to do and someone can help me understand how I can get the result I want.

BackGround:

I'm new to programming and trying to learn. So I took a project that interested me which involves basically taking list and breaking down each number using only numbers from the list. I know I could easily brute force this(which I did) but I wanted to also learn Hbase, Hadoop, and parallel processing so I wanted do it in a way that I could break the process across various machines. I thought the way to do this was using dynamic programming and recursions to create a table of possibilities that could be broken down even further.

Example:

If I submit the list: [1,2, 4] I should get {1: [[1]], 2: [[1, 1]], 4: [[2, 2]]}. What its basically saying is 2+2=4, 1+1=2, and 1=1..so trying to see all the ways to make 4, I can just look up this list(which will be in a database) and see 2+2=4, then break down 2..and so on.. I have the lookup job working but the breakdown I am having problems with. Using brute force will not let me use large numbers(say a million, with a thousand numbers in the list) in a way that I can use hadoop or some other tool to scale it. Here are some more examples of possible results:

[1,2, 3, 4] = {1: [[1]], 2: [[1, 1]], 3: [[1, 2]], 4: [[1, 3], [2, 2]]}
[1,2, 3, 4, 6, 10] = {1: [[1]], 2: [[1, 1]], 3: [[1, 2]], 4: [[1, 3], [2, 2]], 6: [[2, 4], [3, 3]], 10: [[4, 6], [2, 2, 2, 2, 2]]}
[10, 30, 50] = 50: [[10, 10, 30]], 30: [[10, 10, 10]]}

The logic with this approach is that it will not take time to comput the next possible data in the list, so if I send a list with a million numbers it could do it quickly and even scale to a hadoop cluster.

The code I created to get this to work is here but that question was on how to correct the design problem. I got advice that this is a partition problem and looked around and found much simpler versions of what I was trying to do ( activestate ), but its not exactly what I am trying to do because it breaks down a number and does not use a specific data set to do it.

Question:

So hopefully I clearly described what I am trying to do. What do I need to read/study/learn to create a table of a partition of an list in python using dynamic programming so it can scale? Its just a hobby and not time sensitive but I feel I have been working on this for over 3 months and each time I hit design problems that cause me to have to start from scratch. How can I build this correctly(to see my incorrect way see my link above)? I have googled and found solutions to the knapsack problem and partition problems but they seem to be more for school work and not really built to scale with large datasets.

Hopefully someone can give me insight but regardless thank you for reading this.

Community
  • 1
  • 1
Lostsoul
  • 25,013
  • 48
  • 144
  • 239

1 Answers1

3

You have to be aware, that DP problems are not per se optimal for independent and distributed computations.

When you consider the classical forms of DP algorithms, you have a matrix/table/array and you successively compute new values in a certain order. Each computation of a value requires other values that have to be created before. Hence, you lose data independence and can maximally compute a certain number of array fields at the same time, depending on the specific DP algorithms. For example, many DP algorithms can process a whole table column in parallel, as each field relies on fields from a previous column. But that's already the limit due to the data dependency of all remaining fields after that column.

That being said, calculating the sum possibilities of the various numbers available in your list is NOT a DP problem. You do not solve any sub-problems at all, but simply collect all possible sums (if they happen to match one of your list entries).

Hence, I suggest the following sightly different approach:

  • Compute a new list with all possible summations. This is data independent and can be parallelized, but you need some upper bound for termination. Example: [1,2,4] becomes [ [1],[2],[4],[1,1],[1,2],[1,4],...]. You don't have to explicitly construct this list, but just pass each such combination on to the next step.
  • Evaluate each computation, i.e. create the sum and check whether it matches a value from your original list. Again, you are data independent and can perform all these calculations independently.
  • Combine the positive results into the final datastructure.

So to sum it up and answer your questions:

  • Rethink, whether you want to regard this problem as DP at all.
  • You may want to read up on data-parallelism. This is particularly relevant when solving such problems with GPUs, so the corresponding literature on CUDA/OpenCL may be useful, too.
Frank
  • 10,461
  • 2
  • 31
  • 46
  • Thank you so much for taking time to answer this question, Frank. I thought dynamic programming basically helped me generated the pre-computed table based, but I thought about it and had the idea that maybe I don't need to give the dynamic function the entire list, maybe I could break down the list so its processed somewhat independently. For example, 4 can break down into [2,2] and 2 can be [1,1] but I don't need to do this on the same cpu since they seem independent. Also to save cpu time, I didn't compute the entire list but I thought only to the next variable. – Lostsoul Oct 06 '11 at 16:37
  • I don't fully understand your solution though. I have seen others(when speaking of DP) show me a simlair table as well, but what does [1,4] mean? 1 can produce 4? If so how would it solve a number of 5 using a list of [1,2,4]..correct answer should be [4 + 1] but I'm not sure how to generate a list to get that result. – Lostsoul Oct 06 '11 at 16:40
  • In this approach [1,4] is part of your solution in that it is read as 1+4 makes up 5. Note that the first step only creates different possible sums, but doesn't yet care about the value of this sum. – Frank Oct 07 '11 at 05:08
  • oh okay, so am I correct in understanding I create a column with 5 and then a value of [1,4] and so on..so I basically just do a lookup for 5 on my table and then get that value? – Lostsoul Oct 07 '11 at 05:28
  • Lostsoul, don't forget to accept the best answer. I'm pretty sure @Frank's answer is going to be the best one... – steveha Oct 07 '11 at 09:18
  • @steveha I just accepted it. His answer has been super helpful. – Lostsoul Oct 08 '11 at 04:06