0

This should be fairly easy but I can't find a quick way to do it. All I want to do is replace certain levels of a factor with random numbers (I'm building a dataframe from scratch and want certain levels of the factors to have different ranges of values).

data <- data.frame(
    animal = sample(c("lion","tiger","bear"),50,replace=TRUE),
    region = sample(c("north","south","east","west"),50,replace=T),
    reports = sample(50:100,50,replace=T))

Something like this doesn't work, because you have to specify the number of elements to be generated

data$animal <- sub("lion",rnorm(15,10,2),data$animal)

Which gives the warning:

    Warning: In sub("lion", rnorm(15, 10, 2), data$animal) :
  argument 'replacement' has length > 1 and only the first element will be used

Does anybody have an easy way to do this, or is it not possible to use the "sub" expressions with numbers?

Marc Tulla
  • 1,751
  • 2
  • 20
  • 34
  • There's a question of data type here - in your example, data$animal is a character vector, not a factor. When you replace in random numbers, you'll have to re-type those numbers as a character. – Drew Steen Dec 29 '13 at 21:11
  • @DrewSteen It is most likely a factor. Whether it is, depends on the value of `default.stringsAsFactors()`. – Matthew Lundberg Dec 29 '13 at 21:14
  • @MatthewLundberg - True, if he's importing the data from a file. But as he's 'building the dataframe from scratch', it might not be. – Drew Steen Dec 30 '13 at 01:34
  • @DrewSteen - Doesn't matter. See http://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters or http://stackoverflow.com/questions/11538532/change-stringsasfactors-settings-for-data-frame or `?data.frame`. – Matthew Lundberg Dec 30 '13 at 01:43

1 Answers1

1

I don't understand why you'd want this, but here we go.

set.seed(42)
data <- data.frame(
  animal = sample(c("lion","tiger","bear"),50,replace=TRUE),
  region = sample(c("north","south","east","west"),50,replace=T),
  reports = sample(50:100,50,replace=T))

data$animal <- as.character(data$animal)
to.change <- data$animal=="lion"
data$animal[to.change] <- rnorm(sum(to.change),10,2)

#              animal region reports
# 1              bear  south      81
# 2              bear  south      61
# 3  11.1619929953634  south      61
# 4              bear   west      69
# 5             tiger  north      98
# 6             tiger   east      99
# 7              bear   east      87
# 8  11.5363574756692  north      87
# 9             tiger  south      77
# 10             bear   east      50
# 11            tiger   east      81
# 12             bear   west      92
# 13             bear   west      88
# 14 10.9275351770803   east      73
# 15            tiger   west      77
# 16             bear  north      77
# 17             bear  south      50
# 18 8.22844740518064   west      68
# 19            tiger   east      81
# 20            tiger  north      92
# 21             bear  north      68
# 22 7.80043820270429  north      70
# 23             bear  north      79
# 24             bear  south      80
# 25 13.0254140196099  north      86
# 26            tiger   east      70
# 27            tiger  north      96
# 28             bear  south      99
# 29            tiger   east      61
# 30             bear  north      86
# 31             bear   east      96
# 32             bear  north      80
# 33            tiger  south      82
# 34             bear   east      97
# 35 10.5158428750641   west      93
# 36             bear   east      79
# 37 10.1768804583192  north      91
# 38 9.75820692492182  north      55
# 39             bear  north      88
# 40            tiger  south      81
# 41            tiger   east      57
# 42            tiger  north      54
# 43 7.61134220967894  north      73
# 44             bear   west      89
# 45            tiger   west      87
# 46             bear   east      91
# 47             bear  south      58
# 48            tiger   east      98
# 49             bear   east      64
# 50            tiger   east      57

Edit:

From your comment it seems you actually want something like this:

offense <- data.frame(animal=c("lion","tiger","bear"),
                      mean=c(35,25,10),
                      sd=c(3,2,1))

library(plyr)
data <- ddply(merge(data, offense), 
              .(animal), 
              transform, 
                  attacks=rnorm(length(mean), mean=mean, sd=sd),
                  mean=NULL,
                  sd=NULL)

#    animal region reports   attacks
# 1    bear  south      81 10.580996
# 2    bear  south      61 10.768179
# 3    bear  north      77 10.463768
# 4    bear   west      69  9.114224
# 5    bear   east      96  8.900219
# 6    bear  north      80 11.512707
# 7    bear   east      87 10.257921
# 8    bear  north      68 10.088440
# 9    bear   west      88  9.879103
# 10   bear   east      50  8.805671
# 11   bear  south      80 10.611997
# 12   bear   west      92  9.782860
# 13   bear  south      50  9.817243
# 14   bear   west      89 10.933346
# 15   bear  south      99 10.821773
# 16   bear   east      91 11.392116
# 17   bear   east      97  9.523826
# 18   bear  north      88 10.650349
# 19   bear  north      79 11.391110
# 20   bear   east      79  8.889211
# 21   bear   east      64  9.139207
# 22   bear  north      86  8.868261
# 23   bear  south      58  8.540786
# 24   lion   west      68 35.239948
# 25   lion  south      61 36.959613
# 26   lion  north      70 38.602896
# 27   lion  north      73 38.134253
# 28   lion  north      91 31.990374
# 29   lion  north      86 40.545446
# 30   lion   east      73 32.999680
# 31   lion  north      87 35.316541
# 32   lion   west      93 33.733232
# 33   lion  north      55 34.632949
# 34  tiger   west      77 25.376386
# 35  tiger   east      61 25.238322
# 36  tiger   east      99 24.949815
# 37  tiger   east      81 25.216145
# 38  tiger  north      92 24.029130
# 39  tiger  north      96 23.991566
# 40  tiger  south      81 21.677802
# 41  tiger   east      81 24.235333
# 42  tiger  north      54 23.974699
# 43  tiger  south      77 30.403782
# 44  tiger  north      98 22.275768
# 45  tiger   east      57 25.274512
# 46  tiger  south      82 22.012750
# 47  tiger   east      70 22.059129
# 48  tiger   east      98 25.249405
# 49  tiger   west      87 23.006722
# 50  tiger   east      57 24.996355
Roland
  • 127,288
  • 10
  • 191
  • 288
  • Roland, in reference to your comment, do you have a better idea for how I'd do the following: Within a dataframe, generate a second variable that consists of random numbers for the factor variable, with different levels of the factor having higher or lower numbers? For example, in this dataframe, having a new variable called "attacks" that, for "bear" averages around 10, for "tiger" averages around 25, and for "lion" averages around 35 Let me know if the question doesn't make sense... – Marc Tulla Dec 30 '13 at 01:28