Given set of column values, create data.frame with known number of rows

Question

I'm trying to make datasets of a fixed number of rows to make test datasets - however I'm writing to a destination that requires known keys for each column. For my example, assume that these keys are lowercase letters, upper case letters and numbers respectively.

I need to make a function which, provided only the required number of rows, combines keys such that the number of combinations is equal the required number. Naturally there will be some impossible cases such as prime numbers than the largest key and values larger than the product of the number of keys.

A sample output dataset of 10 rows could look like the following:

data.frame(col1 = rep("a", 10),
           col2 = rep(LETTERS[1:5], 2),
           col3 = rep(1:2, 5))

   col1 col2 col3
1     a    A    1
2     a    B    2
3     a    C    1
4     a    D    2
5     a    E    1
6     a    A    2
7     a    B    1
8     a    C    2
9     a    D    1
10    a    E    2

Note here that I had to manually specify the keys to get the desired number of rows. How can I arrange things so that R can do this for me?

Things I've already considered

optim - The equation I'm trying to solve is effectively x * y * z = n where all of them must be integers. optim doesn't seem to support that constraint
expand.grid and then subset - almost 500 million combinations, eats up all my memory - not an option.
lpSolve - Has the integer option, but only seems to support linear equations. Could use logs to make it linear, but then I can't use the integer option.
factorize from gmp to get factors - Thought about this, but I can't think of a way to distribute the prime factors back into the keys. EDIT: Maybe a bin packing problem?

In my case it's usually four but it varies. I always know in advance how many there are and what the keys are. Solving the example in the question is enough for me to work out how to apply it more broadly. — sebastian-c, Jun 09 '17 at 09:46
I am still fighting to understand what do you mean by *combines keys such that the number of combinations is equal the required number*. I haven't got the whole picture... — amonk, Jun 09 '17 at 09:50
There seems also to be a grammatical mistake in your last sentence, and it's not clear if you're giving up a sample input or output — moodymudskipper, Jun 09 '17 at 09:57
Consider a smaller example. Assume two columns where the set of keys for the first is lower case letters and the second upper case letters. To get 10 rows where all combinations are unique, any of the following are valid: (10 from col1, 1 from col2), (5, 2), (2, 5), (1, 10). I don't need all of these, just one valid combination. — sebastian-c, Jun 09 '17 at 09:57
I'm giving a sample output that I'd like where the input is just the desired number of rows. I've now specified that. — sebastian-c, Jun 09 '17 at 09:58
I've manually chosen (1, 5, 2) (1 lower case, 5 upper case, 2 numbers) as the keys. The goal is to have an algorithm make this choice for me to get 10 rows (or however many). — sebastian-c, Jun 09 '17 at 10:23
Did you look at the [wakefield-package](https://cran.r-project.org/web/packages/wakefield/index.html) ([GH-repo](https://github.com/trinker/wakefield))? It is designed for generating random datasets. — Jaap, Jun 12 '17 at 06:01
@Jaap, it doesn't seem to me like it produces unique records, unfortunately. — sebastian-c, Jun 13 '17 at 08:44

apitsch · Accepted Answer · 2017-06-10T09:04:59.643

1

For integer optimisation on a low level scale you can use a grid search. Other possibilities are described here.

This should work for your example.

N <- 10
fr <- function(x) { 
  x1 <- x[1]
  x2 <- x[2]
  x3 <- x[3]
  (x1 * x2 * x3 - N)^2
}
library(NMOF)
gridSearch(fr, list(seq(0,5), seq(0,5), seq(0,5)))$minlevels

edited Jun 10 '17 at 09:04

answered Jun 09 '17 at 12:20

apitsch

1,532
14
31

This is exactly the kind of result I'm after, but you're right that it only works at a low level scale. – sebastian-c Jun 09 '17 at 12:36
To reduce the computational time for higher `N` you can of course reduce the number of values the different variables can take manually. Nonetheless, it is a caveat. – apitsch Jun 09 '17 at 13:00

score 0 · Answer 2 · answered Jun 09 '17 at 12:10

0

I am a bit reluctant,but we can work things out:

  a1<-2
  a2<-5

  eval(parse(text=paste0("data.frame(col1 = rep(LETTERS[1],",a1*a2,"),col2 = 
  rep(LETTERS[1:",a2,"],",a1,"),col3 = rep(1:",a1,",",a2,"))")))

    col1 col2 col3
1     A    A    1
2     A    B    2
3     A    C    1
4     A    D    2
5     A    E    1
6     A    A    2
7     A    B    1
8     A    C    2
9     A    D    1
10    A    E    2

Is this something similar to what you are asking?

answered Jun 09 '17 at 12:10

amonk

1,769
2
18
27

No, I'm afraid. It's the wrong approach. Here you're specifying `a1` and `a2`. I would like R to automatically work out those values when I provide the number of rows I would like. There would be multiple solutions, but I need only one. – sebastian-c Jun 09 '17 at 12:26

Given set of column values, create data.frame with known number of rows

2 Answers2