random sampling of columns based on column group

Question

I have a simple problem which can be solved in a dirty way, but I'm looking for a clean way using data.table

I have the following data.table with n columns belonging to m unequal groups. Here is an example of my data.table:

dframe   <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters


           A           A          A           A           A          A
1 -0.7431185 -0.06356047 -0.2247782 -0.15423889 -0.03894069  0.1165187
2 -1.5891905 -0.44468389 -0.1186977  0.02270782 -0.64950716 -0.6844163
          A         A          A          A         B         B          B
1 -1.277307 1.8164195 -0.3957006 -0.6489105 0.3498384 -0.463272  0.8458673
2 -1.644389 0.6360258  0.5612634  0.3559574 1.9658743  1.858222 -1.4502839
           B          B          B         B          B           B          B
1  0.3167216 -0.2919079  0.5146733 0.6628149  0.5481958 -0.01721261 -0.5986918
2 -0.8104386  1.2335948 -0.6837159 0.4735597 -0.4686109  0.02647807  0.6389771
           B          B           B          B          C           C
1 -1.2980799  0.3834073 -0.04559749  0.8715914  1.1619585 -1.26236232
2 -0.3551722 -0.6587208  0.44822253 -0.1943887 -0.4958392  0.09581703
           C          C          C         C
1 -0.1387091 -0.4638417 -2.3897681 0.6853864
2  0.1680119 -0.5990310  0.9779425 1.0819789

What I want to do is to take a random subset of the columns (of a sepcific size), keeping the same number of columns per group (if the chosen sample size is larger than the number of columns belonging to one group, take all of the columns of this group).

I have tried an updated version of the method mentioned in this question:

sample rows of subgroups from dataframe with dplyr

but I'm not able to map the column names to the by argument.

Can someone help me with this?

Not clear to me. You want to take a subset but the number of columns per group remains the same?? Do you mean you just want to order the columns randomly? Please clarify — talat, Jun 14 '17 at 11:53
@docendodiscimus the number of columns should ONLY remain the same if the random sample size is larger than the actual number of columns per group. e.g., in the example dataframe, lets assume the sample size is 7, the resulting data.table should include 7 random columns belonging to A, 7 random columns belonging to B and ALL columns belonging to C (because C has only 6 columns belonging to it, which is smaller than the chosen sample size) — ifreak, Jun 14 '17 at 12:02

talat · Accepted Answer · 2017-06-14T12:38:46.287

4

Here's another approach, IIUC:

idx <- split(seq_along(dframe), names(dframe))
keep <- unlist(Map(sample, idx, pmin(7, lengths(idx))))

dframe[, keep]

Explanation:

The first step splits the column indices according to the column names:

idx
# $A
# [1]  1  2  3  4  5  6  7  8  9 10
# 
# $B
# [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24
# 
# $C
# [1] 25 26 27 28 29 30

In the next step we use

pmin(7, lengths(idx))
#[1] 7 7 6

to determine the sample size in each group and apply this to each list element (group) in idx using Map. We then unlist the result to get a single vector of column indices.

edited Jun 14 '17 at 12:38

answered Jun 14 '17 at 12:09

talat

68,970
21
126
157

seems to be working nicely, can you please explain the code to me? as there are function I never used. – ifreak Jun 14 '17 at 12:32

score 0 · Answer 2 · answered Jun 14 '17 at 11:45

0

Not sure if you want a solution with dplyr, but here's one with just lapply:

dframe   <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters

# Number of columns to sample per group
nc <- 8


res <- do.call(cbind,
       lapply(unique(colnames(dframe)),
              function(x){
                         dframe[,if(sum(colnames(dframe) == x) <= nc) which(colnames(dframe) == x) else sample(which(colnames(dframe) == x),nc,replace = F)]
                         }
))

It might look complicated, but it really just takes all columns per group if there's less than nc, and samples random nc columns if there are more than nc columns.

And to restore your original column-name scheme, gsub does the trick:

colnames(res) <- gsub('.[[:digit:]]','',colnames(res))

answered Jun 14 '17 at 11:45

Val

6,585
5
22
52

This seems to work, but some of the colnames in res are random and has nothing to do with the original column names – ifreak Jun 14 '17 at 12:06
what do you mean? I get the colnames A - B - C, with an integer appended indicating the sample number (First one is A, second one A.1, and so forth). With the `gsub` function, you can get back to the original A-B-C. – Val Jun 14 '17 at 12:23
My column names do not only contain A, B, C, .... they can be also more characters included. I'm getting something like this as column names: `c(0.2818491673, 0.6562765283, 0, 0, 0, 5.318117652, 0.6930066962,` – ifreak Jun 14 '17 at 12:26
I'm sorry, I coded this based on the example you have given above. Are your column names following any specific naming pattern? – Val Jun 14 '17 at 12:38
no specific pattern, can be any combination of characters – ifreak Jun 14 '17 at 12:50

random sampling of columns based on column group

2 Answers2