I tried to answer your questions at the end. First, an example data frame to play around with:
set.seed(123)
df <- data.frame(id=c(paste0(letters[1:10], 1:10)), matrix(sample(1:20, 500, replace=T), nrow=100, ncol=5))
colnames(df)[2:6] <- paste0("var", 1:5)
1. Count values of a variable
For the first question, I'm not sure why you wouldn't do this with table(var2)
, but if you want, there are a couple of ways to do it.
count if var2==3 /* counts the number of observations that
have a value of 3 on var2 */
With the first one I tried to replicated what Stata does when you ask it to count. Here we subset the data frame for var2==3
, then count the number of rows.
nrow(df[df$var2==3, ])
You can do this more directly by taking the vector df$var2==3
, which is a logical TRUE/FALSE
vector with the same length as nrow(df)
, and summing the values, which will implicitly convert the vector from logical to 0/1
sum(df$var2==3)
2. List values meeting a condition
The second question also basically comes to down to subsetting, and in general I think what you would use if
in Stata for comes down to subsetting a R data frame with the same logical conditions.
list id if var7 < 8 /* lists the ID of observations with a
value less than 8 on var7 */
So here we subset the data frame by restricting rows to those that meet the condition var5 < 8
and selecting the variable, id
, that we want.
df$id[df$var5 < 8]
# or
df[df$var5 < 8, "id"]
# or
subset(df, var5 < 8, select="id")
# or
with(df, id[var5 < 8])
People usually don't recommend subset
. The second way is useful if you want to select variables whose names are contained in another object, e.g.
want <- c("id", "var1")
df[df$var5 < 8, want]
3. Tabulate by variable
The last two are a bit trickier.
bysort var3: tab1 var2 var9 if var8=2 | var1 !=11 /* create a series of separate
two-way frequency tables for those observations of var2
and var9 where var8 is 2 or var1 isn't 11 */
We can do this by first subsetting the data we want, and then using by
to tabulate var2
and var3
by var1
.
foo <- df[df$var4==20 | df$var5==7, ]
by(foo, foo$var1, function(x) table(x[, c("var2", "var3")]))
The function(x)
part is called an anonymous function I think, and is common when you use functions like by
, apply
, etc. The call to by
will break foo
into pieces by var1
, and then pass it on as the argument for our anonymous function, i.e. x
. What gets passed on is a subset of foo
, thus a data frame containing the original variable names, which is why we can subset x
the same way we would foo
.
Technically you can also just add all three to the table
call but that doesn't work well with so many variable values:
table(foo$var2, foo$var3, foo$var1)
4. Non-missing observations (?)
The last question is a bit strange. Wouldn't the count of var1
by var2
just be the frequency of values in var2
unless there were missing values? I'll assume there are missing values then.
egen var3 = count(var1), by(var2) /* creates var3 as the total observations in
var1, for each category in var2 */
So here we break df
into partitions by df$var2
and then apply a function that will count non-missing values in var3
. The last bit changes it to a data frame with the var2
values and non-missing var3
counts.
v3obs <- by(df, df$var2, function(x) sum(!is.na(x$var3)))
v3obs[]
v3obs <- data.frame(var2=names(v3obs[]), var6=v3obs[])
We can now merge the result back to our data frame to replicate what egen
does.
foo <- merge(foo, v3obs, by="var2", type="left")
You could also do this with a for
loop where you loop through rows, subset var3
for the value of var2
and fill in the count non-missing observations. This might be easier to read but less efficient. There are probably also fancier ways of doing this that I'm not aware of, and by
is not really that intuitive to me (I also came from a Stata background) so I generally try to avoid it.