6

I have a simple vector of integers in R. I would like to randomly select n positions in the vector and "merge" them (i.e. sum) in the vector. This process could happen multiple times, i.e. in a vector of 100, 5 merging/summing events could occur, with 2, 3, 2, 4, and 2 vector positions being merged in each event, respectively. For instance:

#An example original vector of length 10:
ex.have<-c(1,1,30,16,2,2,2,1,1,9)

#For simplicity assume some process randomly combines the 
#first two [1,1] and last three [1,1,9] positions in the vector. 

ex.want<-c(2,30,16,2,2,2,11)

#Here, there were two merging events of 2 and 3 vector positions, respectively

#EDIT: the merged positions do not need to be consecutive. 
#They could be randomly selected from any position. 

But in addition I also need to record how many vector positions were "merged," (including the value 1 if the position in the vector was not merged) - terming them indices. Since the first two were merged and the last three were merged in the example above, the indices data would look like:

ex.indices<-c(2,1,1,1,1,1,3)

Finally, I need to put it all in a matrix, so the final data in the example above would be a 2-column matrix with the integers in one column and the indices in another:

ex.final<-matrix(c(2,30,16,2,2,2,11,2,1,1,1,1,1,3),ncol=2,nrow=7)

At the moment I am seeking assistance even on the simplest step: combining positions in the vector. I have tried multiple variations on the sample and split functions, but am hitting a dead end. For instance, sum(sample(ex.have,2)) will sum two randomly selected positions (or sum(sample(ex.have,rpois(1,2)) will add some randomness in the n values), but I am unsure how to leverage this to achieve the desired dataset. An exhaustive search has led to multiple articles on combining vectors, but not positions in vectors, so I apologize if this is a duplicate. Any advice on how to approach any of this would be much appreciated.

jpsmith
  • 11,023
  • 5
  • 15
  • 36
  • This could be interesting. A few questions. (1) How do you determine the *number* of elements that are summed? Is that a random number? In other words: What are the rules for merging the first 2 and the last 3 elements? (2) What are the rules for selecting indices of elements that will be merged? Are they (uniform-)randomly (?) chosen? I can think of a few edge cases that may or may not arise. For example, what if the starting position is the last element, and you'd like to sum the next 4 elements (which don't exist). Some details from you will help clarify on how to deal with those cases. – Maurits Evers Dec 04 '19 at 02:19
  • [continued] One more question: What determines the *number* of merges per vector? Is that also a (uniform-)random number? – Maurits Evers Dec 04 '19 at 02:22
  • The summing and tracking seems easy - it's just a grouped sum, you can use your favorite method from the [sum by group FAQ](https://stackoverflow.com/q/1660124/903061). As Maurits says, the interesting (and unclear) part is the random selection of indices. More info is needed there. – Gregor Thomas Dec 04 '19 at 02:23
  • Thank you! First, I realize in my example that the vector positions in the two merges are both consecutive positions - that need not be the case - i.e. the [1] and [4] positions could have been combined instead of the [1],[2] positions in the first merge. For simplicity (and real-world application) the number of elements to be summed should only range from 2-4 uniformly. So for each merge event, any 2-4 elements in the vector could be randomly selected and merged. The number of merges per vector should be a proportion - i.e. 20% of the positions in the vector will be merged. – jpsmith Dec 04 '19 at 02:45
  • 2
    In the case of the ```1``` and ```4``` positions, how would your output look? Specifically, what would ```ex.indices``` look like? – Cole Dec 04 '19 at 04:13
  • Hi - apologies for the delay (my toddler got sick!) - the indices would reflect where the summation is in the new position. For instance, if `1` and `4` were merged at the original `4` position (along with the last 3) and the resultant vector was c(1,30,17,2,2,2,11), indices would be c(1,1,2,1,1,1,3). But if it was at the original `1` position c(17,1,30,2,2,2,11), ides would be c(2,1,1,1,1,1,3). The position where the summation occurs is not important, just important to map the indices to the merges position – jpsmith Dec 04 '19 at 12:52

2 Answers2

1

I suppose you could write a function like the following:

fun <- function(vec = have, events = merge_events, include_orig = TRUE) {
  if (sum(events) > length(vec)) stop("Too many events to merge")

  # Create "groups" for the events
  merge_events_seq <- rep(seq_along(events), events) 

  # Create "groups" for the rest of the data
  remainder <- sequence((length(vec) - sum(events))) + length(events)

  # Combine both groups and shuffle them so that the 
  # positions being combined are not necessarily consecutive
  inds <- sample(c(merge_events_seq, remainder))

  # Aggregate using `data.table`
  temp <- data.table(values = vec, groups = inds)[
    , list(count = length(values), 
           total = sum(values),
           pos = toString(.I),
           original = toString(values)), groups][, groups := NULL]

  # Drop the other columns if required. Return the output.
  if (isTRUE(include_orig)) temp[] else temp[, c("original", "pos") := NULL][]
}

The function returns four columns:

  1. The count of values that were included in a particular sum (your ex.indices).
  2. The total after summing relevant values (your ex.want).
  3. The positions of the original values from the input vector.
  4. The original values themselves, in case you want to verify it later.

The last two columns can be dropped from the result by setting include_orig = FALSE. The function will also produce an error if the number of elements you're trying to merge exceeds the length of the input (ex.have) vector.

Here's some sample data:

library(data.table)
set.seed(1) ## So you can recreate these examples with the same results
have <- sample(20, 10, TRUE)
have
##  [1]  4  7  1  2 11 14 18 19  1 10

merge_events <- c(2, 3)

fun(have, merge_events)
##    count total      pos   original
## 1:     1     4        1          4
## 2:     1     7        2          7
## 3:     2     2     3, 9       1, 1
## 4:     1     2        4          2
## 5:     3    40 5, 8, 10 11, 19, 10
## 6:     1    14        6         14
## 7:     1    18        7         18

fun(events = c(3, 4))
##    count total        pos     original
## 1:     4    39 1, 4, 6, 8 4, 2, 14, 19
## 2:     3    36    2, 5, 7    7, 11, 18
## 3:     1     1          3            1
## 4:     1     1          9            1
## 5:     1    10         10           10

fun(events = c(6, 4, 3))
## Error: Too many events to merge

input <- sample(30, 20, TRUE)
input
##  [1]  6 10 10  6 15 20 28 20 26 12 25 23  6 25  8 12 25 23 24  6

fun(input, events = c(4, 7, 2, 3))
##    count total                    pos                original
## 1:     7    92 1, 3, 4, 5, 11, 19, 20 6, 10, 6, 15, 25, 24, 6
## 2:     1    10                      2                      10
## 3:     3    71               6, 9, 14              20, 26, 25
## 4:     4    69          7, 12, 13, 16           28, 23, 6, 12
## 5:     2    45                  8, 17                  20, 25
## 6:     1    12                     10                      12
## 7:     1     8                     15                       8
## 8:     1    23                     18                      23

# Verification
input[c(1, 3, 4, 5, 11, 19, 20)]
## [1]  6 10  6 15 25 24  6

sum(.Last.value)
## [1] 92
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
1

Here is a function I designed to perform the task you described.

The vec_merge function takes the following arguments:

x: an integer vector.

event_perc: The percentage of an event. This is a number of between 0 to 1 (although 1 is probably too large). The number of events is calculated as the length of x multiplied by event_perc.

sample_n: The merge sample numbers. This is an integer vector with all numbers larger or at least equal to 2.

vec_merge <- function(x, event_perc = 0.2, sample_n = c(2, 3)){
  # Check if event_perc makes sense
  if (event_perc > 1 | event_perc <= 0){
    stop("event_perc should be between 0 to 1.")
  }
  # Check if sample_n makes sense
  if (any(sample_n < 2)){
    stop("sample_n should be at least larger than 2")
  }
  # Determine the event numbers
  n <- round(length(x) * event_perc)
  # Determine the sample number of each event
  sample_vec <- sample(sample_n, size = n, replace = TRUE)
  names(sample_vec) <- paste0("S", 1:n)
  # Check if the sum of sample_vec is larger than the length of x
  # If yes, stop the function and print a message 
  if (length(x) < sum(sample_vec)){
    stop("Too many samples. Decrease event_perc or sampel_n")
  }
  # Determine the number that will not be merged
  n2 <- length(x) - sum(sample_vec) 
  # Create a vector with replicated 1 based on m
  non_merge_vec <- rep(1, n2)
  names(non_merge_vec) <- paste0("N", 1:n2)
  # Combine sample_vec and non_merge_vec, and then randomly sorted the vector
  combine_vec <- c(sample_vec, non_merge_vec)
  combine_vec2 <- sample(combine_vec, size = length(combine_vec))
  # Expand the vector
  expand_list <- list(lengths = combine_vec2, values = names(combine_vec2))
  expand_vec <- inverse.rle(expand_list)
  # Create a data frame with x and expand_vec
  dat <- data.frame(number = x, 
                    group = factor(expand_vec, levels = unique(expand_vec)))
  dat$index <- 1
  dat2 <- aggregate(cbind(dat$number, dat$index), 
                    by = list(group = dat$group),
                    FUN = sum)
  # # Convert dat2 to a matrix, remove the group column
  dat2$group <- NULL
  mat <- as.matrix(dat2)
  return(mat)
}

Here is a test for the function. I applied the function to the sequence from 1 to 10. As you can see, in this example, 4 and 5 is merged, and 8 and 9 is also merged.

set.seed(123)
vec_merge(1:10)
#      number index
# [1,]      1     1
# [2,]      2     1
# [3,]      3     1
# [4,]      9     2
# [5,]      6     1
# [6,]      7     1
# [7,]     17     2
# [8,]     10     1
www
  • 38,575
  • 12
  • 48
  • 84