22

Is there a reason why there are two different commands to generate a new variable?

Is there a simple way to remember when to use gen and when to use egen?

max
  • 49,282
  • 56
  • 208
  • 355

2 Answers2

21

They both create a new variable, but work with different sets of functions. You will typically use gen when you have simple transformations of other variables in your dataset like

gen newvar = oldvar1^2 * oldvar2

In my workflow, egen usually appears when I need functions that work across all observations, like in

egen max_var = max(var)

or more complex instructions

egen newvar = rowmax(oldvar1 oldvar2)

to calculate the maximum for each observation between oldvar1 and oldvar2. I don't think there is a clear logic for separating the two commands.

griverorz
  • 677
  • 5
  • 11
  • 4
    There's a pretty clear logic, actually. If the task can be done with the existing mathematical functions, you use `generate`. If this is something more complicated, e.g. needs to be done on groups of observations (which are not very easily addressed in Stata), you would need to look for an appropriate `egen` function. – StasK Oct 21 '12 at 00:38
  • 3
    Agree. But I still don't see the logic of having two separate commands. – griverorz Oct 21 '12 at 01:01
  • 8
    I think Stata's logic is very clear: `egen` is used whenever `gen` isn't :) – max Oct 21 '12 at 01:51
  • 4
    @griverorz, there are differences in implementation. `generate` is a fast internal command. `egen` is being parsed by Stata, and you can write extensions to it using Stata ado-code. You cannot do that with `generate`. This is a rather painful legacy of the 80s as compared to R where you can define a function inline and forget it after it was used. – StasK Oct 21 '12 at 02:57
  • StasK is correct. Techincally, `egen` is an "extension" to the `egen` command because it reaches beyond simple computations (`var1 + var2`, `log(var1)`, etc.) to add descriptive stats, standardizations and more. Some of the stuff that can be done with `plyr` and `apply` in R is therefore done with `statsby` and `egen` in Stata. I use it to gently hack confidence intervals and scatterplots. – Fr. Oct 21 '12 at 19:13
  • So is it about backwards compatibility? egen functions not breaking or being ambiguous with gen functions? – Simon_Weaver Aug 18 '21 at 20:31
3

gen

generate may be abbreviated by gen or even g and can be used with the following mathematical operators and functions:

  • + addition
  • - subtraction
  • * multiplication
  • / division
  • ^ power

A large number of functions is available. Here are some examples:

  • abs(x) absolute value of x
  • exp(x) antilog of x
  • int(x) or trunc(x) truncation to integer value
  • ln(x), log(x) natural logarithm of x
  • round(x) rounds to the nearest integer of x
  • round(x,y) x rounded in units of y (i.e., round(x,.1) rounds to one decimal place)
  • sqrt(x)square root of x
  • runiform() returns uniformly distributed numbers between 0 and nearly 1
  • rnormal() returns numbers that follow a standard normal distribution
  • rnormal(x,y) returns numbers that follow a normal distribution with a mean of x and a s.d. of y

egen

A number of more complex possibilities have been implemented in the egen command like in the following examples:

  • egen nkids = anycount(pers1 pers2 pers3 pers4 pers5), value(1)
  • egen v323r = rank(v323)
  • egen myindex = rowmean(var15 var17 var18 var20 var23)
  • egen nmiss = rowmiss(x1-x10 var15-var23)
  • egen nmiss = rowtotal(x1-x10 var15-var23)
  • egen incomst = std(income)
  • bysort v3: egen mincome = mean(income)

Detailed usage explanations can be found at this link.

GorkemHalulu
  • 2,925
  • 1
  • 27
  • 25