0

This is my first attempt at simulating data - we'd like to simulate a dataset and have elected to use simstudy using the following code:

def <- defData(varname='median_household_income',formula=mean(
               df$median_household_income))
def <- defData(def, varname='share_unemployed_seasonal',formula=mean(
               df$share_unemployed_seasonal))
def <- defData(def, varname='share_population_in_metro_areas',
               formula=mean(df$share_population_in_metro_areas))
def <- defData(def, varname='share_population_with_high_school_degree',
               formula=mean(df$share_population_with_high_school_degree))
def <- defData(def, varname='share_non_citizen',
               formula=mean(df$share_non_citizen))
def <- defData(def, varname='share_white_poverty',
               formula=mean(df$share_white_poverty))
def <- defData(def, varname='gini_index',formula=mean(df$gini_index))
def <- defData(def, varname='share_non_white',formula=mean(df$share_non_white))
def <- defData(def, varname='share_voters_voted_trump',
               formula=mean(df$share_voters_voted_trump))
#outcome
def <- defData(def, varname='avg_hatecrimes_per_100k_fbi',formula=
               ".0001*median_household_income + 44*share_unemployed_seasonal + 
               -2.8*share_population_in_metro_areas +
               24*share_population_with_high_school_degree + 22*share_non_citizen + 
               3.2*share_white_poverty + 55*gini_index + -4*share_non_white + 
               -2.6*share_voters_voted_trump")

#generate simulated data
df_sim <- genData(10000,def)

The output looks like this:

 head(df_sim)
 id median_household_income share_unemployed_seasonal share_population_in_metro_areas
1:  1                55223.61                0.04956863                       0.7501961
2:  2                55223.61                0.04956863                       0.7501961
3:  3                55223.61                0.04956863                       0.7501961
4:  4                55223.61                0.04956863                       0.7501961
5:  5                55223.61                0.04956863                       0.7501961
6:  6                55223.61                0.04956863                       0.7501961

Why are all the generated values identicl? My understanding is that the variables are generated according to a normal distribution by default. Any help with this is appreciated!

massisenergy
  • 1,764
  • 3
  • 14
  • 25
Alissa
  • 99
  • 1
  • 5
  • It would be helpful if you specified where do the functions defData and genData come from. – ira Dec 14 '18 at 16:04
  • In general, it is good to follow: https://stackoverflow.com/help/how-to-ask and https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – ira Dec 14 '18 at 16:13

1 Answers1

1

I found that you are referring to a package simstudy. If you check the documentation for defData function (link here), you will find out that there is variance parameter to the defData function which defaults to zero. If you want to have non-identical observations, you need to set this value to a number larger than 0.

The default behavior of defData function:

defData(dtDefs = NULL, varname, formula, variance = 0,
  dist = "normal", link = "identity", id = "id")

So you might want to run a command like

def <- defData(varname='median_household_income',
               formula=mean(df$median_household_income),
               variance = 1)
ira
  • 2,542
  • 2
  • 22
  • 36