Subbing random numbers for text

Question

This should be fairly easy but I can't find a quick way to do it. All I want to do is replace certain levels of a factor with random numbers (I'm building a dataframe from scratch and want certain levels of the factors to have different ranges of values).

data <- data.frame(
    animal = sample(c("lion","tiger","bear"),50,replace=TRUE),
    region = sample(c("north","south","east","west"),50,replace=T),
    reports = sample(50:100,50,replace=T))

Something like this doesn't work, because you have to specify the number of elements to be generated

data$animal <- sub("lion",rnorm(15,10,2),data$animal)

Which gives the warning:

    Warning: In sub("lion", rnorm(15, 10, 2), data$animal) :
  argument 'replacement' has length > 1 and only the first element will be used

Does anybody have an easy way to do this, or is it not possible to use the "sub" expressions with numbers?

There's a question of data type here - in your example, data$animal is a character vector, not a factor. When you replace in random numbers, you'll have to re-type those numbers as a character. — Drew Steen, Dec 29 '13 at 21:11
@DrewSteen It is most likely a factor. Whether it is, depends on the value of `default.stringsAsFactors()`. — Matthew Lundberg, Dec 29 '13 at 21:14
@MatthewLundberg - True, if he's importing the data from a file. But as he's 'building the dataframe from scratch', it might not be. — Drew Steen, Dec 30 '13 at 01:34
@DrewSteen - Doesn't matter. See http://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters or http://stackoverflow.com/questions/11538532/change-stringsasfactors-settings-for-data-frame or `?data.frame`. — Matthew Lundberg, Dec 30 '13 at 01:43

Roland · Accepted Answer · 2013-12-30T17:46:07.333

I don't understand why you'd want this, but here we go.

set.seed(42)
data <- data.frame(
  animal = sample(c("lion","tiger","bear"),50,replace=TRUE),
  region = sample(c("north","south","east","west"),50,replace=T),
  reports = sample(50:100,50,replace=T))

data$animal <- as.character(data$animal)
to.change <- data$animal=="lion"
data$animal[to.change] <- rnorm(sum(to.change),10,2)

#              animal region reports
# 1              bear  south      81
# 2              bear  south      61
# 3  11.1619929953634  south      61
# 4              bear   west      69
# 5             tiger  north      98
# 6             tiger   east      99
# 7              bear   east      87
# 8  11.5363574756692  north      87
# 9             tiger  south      77
# 10             bear   east      50
# 11            tiger   east      81
# 12             bear   west      92
# 13             bear   west      88
# 14 10.9275351770803   east      73
# 15            tiger   west      77
# 16             bear  north      77
# 17             bear  south      50
# 18 8.22844740518064   west      68
# 19            tiger   east      81
# 20            tiger  north      92
# 21             bear  north      68
# 22 7.80043820270429  north      70
# 23             bear  north      79
# 24             bear  south      80
# 25 13.0254140196099  north      86
# 26            tiger   east      70
# 27            tiger  north      96
# 28             bear  south      99
# 29            tiger   east      61
# 30             bear  north      86
# 31             bear   east      96
# 32             bear  north      80
# 33            tiger  south      82
# 34             bear   east      97
# 35 10.5158428750641   west      93
# 36             bear   east      79
# 37 10.1768804583192  north      91
# 38 9.75820692492182  north      55
# 39             bear  north      88
# 40            tiger  south      81
# 41            tiger   east      57
# 42            tiger  north      54
# 43 7.61134220967894  north      73
# 44             bear   west      89
# 45            tiger   west      87
# 46             bear   east      91
# 47             bear  south      58
# 48            tiger   east      98
# 49             bear   east      64
# 50            tiger   east      57

Edit:

From your comment it seems you actually want something like this:

offense <- data.frame(animal=c("lion","tiger","bear"),
                      mean=c(35,25,10),
                      sd=c(3,2,1))

library(plyr)
data <- ddply(merge(data, offense), 
              .(animal), 
              transform, 
                  attacks=rnorm(length(mean), mean=mean, sd=sd),
                  mean=NULL,
                  sd=NULL)

#    animal region reports   attacks
# 1    bear  south      81 10.580996
# 2    bear  south      61 10.768179
# 3    bear  north      77 10.463768
# 4    bear   west      69  9.114224
# 5    bear   east      96  8.900219
# 6    bear  north      80 11.512707
# 7    bear   east      87 10.257921
# 8    bear  north      68 10.088440
# 9    bear   west      88  9.879103
# 10   bear   east      50  8.805671
# 11   bear  south      80 10.611997
# 12   bear   west      92  9.782860
# 13   bear  south      50  9.817243
# 14   bear   west      89 10.933346
# 15   bear  south      99 10.821773
# 16   bear   east      91 11.392116
# 17   bear   east      97  9.523826
# 18   bear  north      88 10.650349
# 19   bear  north      79 11.391110
# 20   bear   east      79  8.889211
# 21   bear   east      64  9.139207
# 22   bear  north      86  8.868261
# 23   bear  south      58  8.540786
# 24   lion   west      68 35.239948
# 25   lion  south      61 36.959613
# 26   lion  north      70 38.602896
# 27   lion  north      73 38.134253
# 28   lion  north      91 31.990374
# 29   lion  north      86 40.545446
# 30   lion   east      73 32.999680
# 31   lion  north      87 35.316541
# 32   lion   west      93 33.733232
# 33   lion  north      55 34.632949
# 34  tiger   west      77 25.376386
# 35  tiger   east      61 25.238322
# 36  tiger   east      99 24.949815
# 37  tiger   east      81 25.216145
# 38  tiger  north      92 24.029130
# 39  tiger  north      96 23.991566
# 40  tiger  south      81 21.677802
# 41  tiger   east      81 24.235333
# 42  tiger  north      54 23.974699
# 43  tiger  south      77 30.403782
# 44  tiger  north      98 22.275768
# 45  tiger   east      57 25.274512
# 46  tiger  south      82 22.012750
# 47  tiger   east      70 22.059129
# 48  tiger   east      98 25.249405
# 49  tiger   west      87 23.006722
# 50  tiger   east      57 24.996355

Roland, in reference to your comment, do you have a better idea for how I'd do the following: Within a dataframe, generate a second variable that consists of random numbers for the factor variable, with different levels of the factor having higher or lower numbers? For example, in this dataframe, having a new variable called "attacks" that, for "bear" averages around 10, for "tiger" averages around 25, and for "lion" averages around 35 Let me know if the question doesn't make sense... — Marc Tulla, Dec 30 '13 at 01:28

Subbing random numbers for text

1 Answers1