0

I'm trying to make datasets of a fixed number of rows to make test datasets - however I'm writing to a destination that requires known keys for each column. For my example, assume that these keys are lowercase letters, upper case letters and numbers respectively.

I need to make a function which, provided only the required number of rows, combines keys such that the number of combinations is equal the required number. Naturally there will be some impossible cases such as prime numbers than the largest key and values larger than the product of the number of keys.

A sample output dataset of 10 rows could look like the following:

data.frame(col1 = rep("a", 10),
           col2 = rep(LETTERS[1:5], 2),
           col3 = rep(1:2, 5))

   col1 col2 col3
1     a    A    1
2     a    B    2
3     a    C    1
4     a    D    2
5     a    E    1
6     a    A    2
7     a    B    1
8     a    C    2
9     a    D    1
10    a    E    2

Note here that I had to manually specify the keys to get the desired number of rows. How can I arrange things so that R can do this for me?

Things I've already considered

  • optim - The equation I'm trying to solve is effectively x * y * z = n where all of them must be integers. optim doesn't seem to support that constraint
  • expand.grid and then subset - almost 500 million combinations, eats up all my memory - not an option.
  • lpSolve - Has the integer option, but only seems to support linear equations. Could use logs to make it linear, but then I can't use the integer option.
  • factorize from gmp to get factors - Thought about this, but I can't think of a way to distribute the prime factors back into the keys. EDIT: Maybe a bin packing problem?
sebastian-c
  • 15,057
  • 3
  • 47
  • 93
  • how many columns? – amonk Jun 09 '17 at 09:39
  • In my case it's usually four but it varies. I always know in advance how many there are and what the keys are. Solving the example in the question is enough for me to work out how to apply it more broadly. – sebastian-c Jun 09 '17 at 09:46
  • I am still fighting to understand what do you mean by *combines keys such that the number of combinations is equal the required number*. I haven't got the whole picture... – amonk Jun 09 '17 at 09:50
  • There seems also to be a grammatical mistake in your last sentence, and it's not clear if you're giving up a sample input or output – moodymudskipper Jun 09 '17 at 09:57
  • 1
    Consider a smaller example. Assume two columns where the set of keys for the first is lower case letters and the second upper case letters. To get 10 rows where all combinations are unique, any of the following are valid: (10 from col1, 1 from col2), (5, 2), (2, 5), (1, 10). I don't need all of these, just one valid combination. – sebastian-c Jun 09 '17 at 09:57
  • 1
    I'm giving a sample output that I'd like where the input is just the desired number of rows. I've now specified that. – sebastian-c Jun 09 '17 at 09:58
  • In your example, how do you *manually specify the keys*? – amonk Jun 09 '17 at 10:10
  • I've manually chosen (1, 5, 2) (1 lower case, 5 upper case, 2 numbers) as the keys. The goal is to have an algorithm make this choice for me to get 10 rows (or however many). – sebastian-c Jun 09 '17 at 10:23
  • Did you look at the [wakefield-package](https://cran.r-project.org/web/packages/wakefield/index.html) ([GH-repo](https://github.com/trinker/wakefield))? It is designed for generating random datasets. – Jaap Jun 12 '17 at 06:01
  • @Jaap, it doesn't seem to me like it produces unique records, unfortunately. – sebastian-c Jun 13 '17 at 08:44

2 Answers2

1

For integer optimisation on a low level scale you can use a grid search. Other possibilities are described here.

This should work for your example.

N <- 10
fr <- function(x) { 
  x1 <- x[1]
  x2 <- x[2]
  x3 <- x[3]
  (x1 * x2 * x3 - N)^2
}
library(NMOF)
gridSearch(fr, list(seq(0,5), seq(0,5), seq(0,5)))$minlevels
apitsch
  • 1,532
  • 14
  • 31
  • This is exactly the kind of result I'm after, but you're right that it only works at a low level scale. – sebastian-c Jun 09 '17 at 12:36
  • To reduce the computational time for higher `N` you can of course reduce the number of values the different variables can take manually. Nonetheless, it is a caveat. – apitsch Jun 09 '17 at 13:00
0

I am a bit reluctant,but we can work things out:

  a1<-2
  a2<-5

  eval(parse(text=paste0("data.frame(col1 = rep(LETTERS[1],",a1*a2,"),col2 = 
  rep(LETTERS[1:",a2,"],",a1,"),col3 = rep(1:",a1,",",a2,"))")))

    col1 col2 col3
1     A    A    1
2     A    B    2
3     A    C    1
4     A    D    2
5     A    E    1
6     A    A    2
7     A    B    1
8     A    C    2
9     A    D    1
10    A    E    2

Is this something similar to what you are asking?

amonk
  • 1,769
  • 2
  • 18
  • 27
  • No, I'm afraid. It's the wrong approach. Here you're specifying `a1` and `a2`. I would like R to automatically work out those values when I provide the number of rows I would like. There would be multiple solutions, but I need only one. – sebastian-c Jun 09 '17 at 12:26