0

I have the following dataframe:

structure(list(Store = c("vpm", "vpm", 
"vpm"), Date = structure(c(18042, 18042, 18042), class = "Date"), 
    UniqueImageId = c("vp3_523", "vp3_668", "vp3_523"), EntryTime = structure(c(1558835514, 
    1558834942, 1558835523), class = c("POSIXct", "POSIXt")), 
    ExitTime = structure(c(1558838793, 1558838793, 1558839824
    ), class = c("POSIXct", "POSIXt")), Duration = c(3279, 3851, 
    4301), Age = c(35L, 35L, 35L), EntryPoint = c("Entry2Side", 
    "Entry2Side", "Entry2Side"), ExitPoint = c("Exit2Side", "Exit2Side", 
    "Exit2Side"), AgeNew = c("15_20", "25_32", "15_20"), GenderNew = c("Female", 
    "Male", "Female")), row.names = 4:6, class = c("data.table", 
"data.frame"))

I am trying to populate a random number for the column AgeNew and I am using sample function with ifelse condition.

I tried the following

d$AgeNew <- ifelse(d$AgeNew == "0_2",   sample(0:2,  1,replace = TRUE), 
            ifelse(d$AgeNew == "15_20", sample(15:20,1,replace = TRUE), 
            ifelse(d$AgeNew == "25_32", sample(25:36,1,replace = TRUE), 
            ifelse(d$AgeNew == "38_43", sample(36:43,1,replace = TRUE), 
            ifelse(d$AgeNew == "4_6",   sample(4:6,  1,replace = TRUE), 
            ifelse(d$AgeNew == "48_53", sample(48:53,1,replace = TRUE), 
            ifelse(d$AgeNew == "60_Inf",sample(60:65,1,replace = TRUE), 
                                        sample(8:13, 1,replace = TRUE))))))))

But I am getting the same value getting repeated. For example, for the age group 0_2 I have only 2 populated. I tried using set.seed

set.seed(123)

and then running the ifelse still it repeats the same value.

Sotos
  • 51,121
  • 6
  • 32
  • 66
Apricot
  • 2,925
  • 5
  • 42
  • 88

3 Answers3

3

This has been discussed somewhere (cannot find the source at the moment). The reason it behaves like this is because ifelse runs only once for one condition hence, the value is recycled. Consider this example,

x <- c(1, 2, 1, 2, 1, 2)

ifelse(x == 1, sample(1:10, 1), sample(20:30, 1))
#[1]  1 26  1 26  1 26
ifelse(x == 1, sample(1:10, 1), sample(20:30, 1))
#[1] 10 28 10 28 10 28
ifelse(x == 1, sample(1:10, 1), sample(20:30, 1))
#[1]  9 24  9 24  9 24

As we can see it gives the same number which is recycled for both the scenarios. To avoid that we need to specify size of sample as length of the test condition in ifelse

ifelse(x == 1, sample(1:10, length(x)), sample(20:30, length(x)))
#[1]  7 23  1 26 10 24
ifelse(x == 1, sample(1:10, length(x)), sample(20:30, length(x)))
#[1]  3 23  5 26  6 22 
ifelse(x == 1, sample(1:10, length(x)), sample(20:30, length(x)))
#[1]  2 30  9 27  1 29
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

An easier option would be to replace the _ with :, and evaluate and sample the elements within that range

library(data.table)
d[, AgeNew := sapply(sub("_", ":", sub('Inf', '65', AgeNew)),
           function(x) sample(eval(parse(text = x)), 1))]
d[is.na(AgeNew), AgeNew := sample(8:13, 1)]
d
#  Store       Date UniqueImageId           EntryTime            ExitTime Duration Age EntryPoint ExitPoint AgeNew GenderNew
#1:   vpm 2019-05-26       vp3_523 2019-05-25 21:51:54 2019-05-25 22:46:33     3279  35 Entry2Side Exit2Side     15    Female
#2:   vpm 2019-05-26       vp3_668 2019-05-25 21:42:22 2019-05-25 22:46:33     3851  35 Entry2Side Exit2Side     30      Male
#3:   vpm 2019-05-26       vp3_523 2019-05-25 21:52:03 2019-05-25 23:03:44     4301  35 Entry2Side Exit2Side     17    Female

Or another option with tidyverse

library(tidyverse)
d %>% 
   mutate(AgeNew = str_replace(AgeNew, "Inf", "65")) %>%
   separate(AgeNew, into = c('start', 'end'), convert = TRUE) %>% 
   mutate(AgNew = map2_int(start, end, ~ sample(.x:.y, 1)))

Or another option is to split by _, and then sample

d[, AgeNew := unlist(lapply(strsplit(sub('Inf', '65', AgeNew),  "_"), function(x)
            sample(as.numeric(x[1]):as.numeric(x[2]), 1)))]

Note that we don't need any nested ifelse to change here. It is much easier to do this without any ifelse

NOTE2: The OP showed a data.table as example and here we are showing data.table methods

NOTE3: It is very inefficient to use nested ifelse

NOTE4: the strsplit based approach was first posted here


Regarding why ifelse works differently, it is already mentioned in the documentation of ?ifelse

If yes or no are too short, their elements are recycled. yes will be evaluated if and only if any element of test is true, and analogously for no.

akrun
  • 874,273
  • 37
  • 540
  • 662
  • getting this error ` Error in `:=`(AgeNew, sapply(sub("_", ":", AgeNew), function(x) sample(eval(parse(text = x)), : could not find function ":=" ` – Apricot Jun 24 '19 at 07:29
  • @Apricot Your dataset seems to be `data.table`. So, I assume that you loaded `data.table` – akrun Jun 24 '19 at 07:30
  • whether I convert the file to a data frame or retain it as dataframe datatable...I am still getting the error. – Apricot Jun 24 '19 at 07:33
  • @Apricot I don't get the error with the example you showed – akrun Jun 24 '19 at 07:33
  • @Apricot Is this error based on the example you showed or different data as I can't reproduce it – akrun Jun 24 '19 at 07:36
0

You will need to handle the Inf. From your example, I assume that you want to add +5 if Inf appears. So based on that assumption, we can do,

sapply(strsplit(d$AgeNew, '_'), function(i){
                  sample(i[1]:replace(i[2], i[2] == 'Inf', as.numeric(i[1]) + 5), 1)
                  })

#[1] 60 32 19

NOTE: I changed the first entry of AgeNew to 60_Inf in order to test

Sotos
  • 51,121
  • 6
  • 32
  • 66