(From Stata to R) Data Exporation and Variable Creation: count, list, bysort, egen

Question

It's been exciting and challenging trying to transition from Stata to R, but one area I'm still struggling with in R is data exploration and then subsequent variable creation. Specifically, how to

count the values of a variable (Stata's count command)

count if var 2==3
/* counts the number of observations that have a value of 3 on var2 */

list observations meeting a condition (Stata's if qualifier)

list id if var7 < 8
/*lists the ID of observations with a value less than 8 on var7 */

tabulate by a grouping variable (Stata's bysort command)

bysort var3: tab1 var2 var9 if var8=2 | var1 !=11
/* create a two-way frequency table for those observations of var2 and var9 where
   var8 is 2 or var1 isn't 11 */

create a new variable from another (Stata's egen command)

egen var3 = count(var1), by(var2)
/* creates var3 as the total observations in var1, for each category in var2;
   here var2 is a categorical variable, so, this code seeks to count the frequency
   of var1 (say, 'trades' among NFL teams), counted separately by each category of
   var2 (say, 32 different NFL teams). */

I'm more experienced in Stata than in R. My advice is: Don't expect one-to-one matches. R's concepts don't all correspond to Stata's and there's no reason why they should. On a political-psychological front the implication that R is lacking because it lacks a twin to anything in Stata may seem as obnoxious or ridiculous or bizarre as the opposite attitude would. — Nick Cox, Dec 18 '14 at 22:04
Have you checked the [statar](https://github.com/matthieugomez/statar) package? — Arun, Dec 18 '14 at 22:06
As an SPSS refugee, I'd really recommend Quick-R - http://statmethods.net/management/index.html - it was the first site that cut through a lot of my confusion and gave simple examples. — thelatemail, Dec 18 '14 at 22:17
Yes, thanks. I've relied heavily on Quick-R and also read R in Action cover to cover. They've all helped a lot, but it seems that there are some pretty basic things I do with little effort in Stata that I can't figure out how to do in R. I don't expect R to match Stata one-to-one (or vice versa), but I'm pretty sure R can probably do these things, and I've been frustrated over the past few weeks trying to figure them out. I will check out the "statar" package. My hope is to do as much as possible in base R, as I find some packages's coding syntax too different, but I'm definitely open to them — coip, Dec 19 '14 at 01:49
In your examples at the bottom I see nothing I couldn't do with R quite easily. Even with base R, although for some of those tasks there are nice packages that have syntax that feels more natural (such as plyr, which you mentioned). As an experienced R user, Stata syntax doesn't look easier to me. Some concepts are just different between the languages, which is not necessarily a shortcoming of either language (but obviously I prefer R :). — Roland, Dec 19 '14 at 08:10
Had a look at http://www.amazon.com/R-Stata-Users-Statistics-Computing/dp/1441913173 ? I used to aggregate, sum up my data using `Stata` until it has lost its all attractions to `dplyr`. — Khashaa, Dec 19 '14 at 12:37

score 6 · Accepted Answer · edited May 23 '17 at 11:59

I tried to answer your questions at the end. First, an example data frame to play around with:

set.seed(123)
df <- data.frame(id=c(paste0(letters[1:10], 1:10)), matrix(sample(1:20, 500, replace=T), nrow=100,     ncol=5))
colnames(df)[2:6] <- paste0("var", 1:5)

1. Count values of a variable

For the first question, I'm not sure why you wouldn't do this with table(var2), but if you want, there are a couple of ways to do it.

count if var2==3       /* counts the number of observations that 
                          have a value of 3 on var2 */

With the first one I tried to replicated what Stata does when you ask it to count. Here we subset the data frame for var2==3, then count the number of rows.

nrow(df[df$var2==3, ])

You can do this more directly by taking the vector df$var2==3, which is a logical TRUE/FALSE vector with the same length as nrow(df), and summing the values, which will implicitly convert the vector from logical to 0/1

sum(df$var2==3)

2. List values meeting a condition

The second question also basically comes to down to subsetting, and in general I think what you would use if in Stata for comes down to subsetting a R data frame with the same logical conditions.

list id if var7 < 8    /* lists the ID of observations with a 
                          value less than 8 on var7 */

So here we subset the data frame by restricting rows to those that meet the condition var5 < 8 and selecting the variable, id, that we want.

df$id[df$var5 < 8]
# or
df[df$var5 < 8, "id"]
# or
subset(df, var5 < 8, select="id")
# or
with(df, id[var5 < 8])

People usually don't recommend subset. The second way is useful if you want to select variables whose names are contained in another object, e.g.

want <- c("id", "var1")
df[df$var5 < 8, want]

3. Tabulate by variable

The last two are a bit trickier.

bysort var3: tab1 var2 var9 if var8=2 | var1 !=11 /* create a series of separate 
                         two-way frequency tables for those observations of var2
                         and var9 where var8 is 2 or var1 isn't 11 */

We can do this by first subsetting the data we want, and then using by to tabulate var2 and var3 by var1.

foo <- df[df$var4==20 | df$var5==7, ]
by(foo, foo$var1, function(x) table(x[, c("var2", "var3")]))

The function(x) part is called an anonymous function I think, and is common when you use functions like by, apply, etc. The call to by will break foo into pieces by var1, and then pass it on as the argument for our anonymous function, i.e. x. What gets passed on is a subset of foo, thus a data frame containing the original variable names, which is why we can subset x the same way we would foo.

Technically you can also just add all three to the table call but that doesn't work well with so many variable values:

table(foo$var2, foo$var3, foo$var1)

4. Non-missing observations (?)

The last question is a bit strange. Wouldn't the count of var1 by var2 just be the frequency of values in var2 unless there were missing values? I'll assume there are missing values then.

egen var3 = count(var1), by(var2)  /* creates var3 as the total observations in 
                                      var1, for each category in var2 */

So here we break df into partitions by df$var2 and then apply a function that will count non-missing values in var3. The last bit changes it to a data frame with the var2 values and non-missing var3 counts.

v3obs <- by(df, df$var2, function(x) sum(!is.na(x$var3)))
v3obs[]

v3obs <- data.frame(var2=names(v3obs[]), var6=v3obs[])

We can now merge the result back to our data frame to replicate what egen does.

foo <- merge(foo, v3obs, by="var2", type="left")

You could also do this with a for loop where you loop through rows, subset var3 for the value of var2 and fill in the count non-missing observations. This might be easier to read but less efficient. There are probably also fancier ways of doing this that I'm not aware of, and by is not really that intuitive to me (I also came from a Stata background) so I generally try to avoid it.

Thanks. This is great. It showed me some nuances in R's language that I hadn't fully picked up on yet and clarified some other things nicely (like function(x)--I've seen it before but thought it was a generic example). Sorry for the ambiguity on some of my questions. For instance, in the [egen var3 = count(var1), by(var2)] code, var2 is a categorical variable, so, this code seeks to count the frequency of var1 (say, 'trades' among NFL teams), counted separately by each category of var2 (say, 32 different NFL teams). — coip, Dec 19 '14 at 18:38
Glad it's helpful. I made the switch from Stata to R a few years ago, and it's a very different way of thinking. — andybega, Dec 20 '14 at 11:56

(From Stata to R) Data Exporation and Variable Creation: count, list, bysort, egen

1 Answers1

1. Count values of a variable

2. List values meeting a condition

3. Tabulate by variable

4. Non-missing observations (?)