0

I have to work on a large database, with many varaibles to generate/modify, and so much code to write. I'm used to the Stata environment where everything you do is "inside" the database.

I'd like to break free from the "database$variable" syntax and be able to use a simple "variable" syntax.

An example of what I want to do: I have my database with the "age" variable and I want to recode it.

agecat1<-(floor(age) )    
agecat1[ agecat1<10] <- NA
agecat1[ agecat1>=17] <- NA
describe(agecat1)

Of course, this code sample does not find the "age" variable.

To make it work, I can either attach my database before running it (works well, for the first part), or write it as follow (but it's exactly what I want to avoid):

agecat1<-(floor(db$age) )    
agecat1[ agecat1<10] <- NA
agecat1[ agecat1>=17] <- NA
describe(agecat1)

And this is where I reach "attach()" limit: my new variable "agecat1" is NOT in my database, it's now an independent value which won't be affected by what I may do with my database (remove rows with NA for example).

So if I want my variable to be included in my DB, I need to write:

db$agecat1<-(floor(db$age) )    
db$agecat1[ db$agecat1<10] <- NA
db$agecat1[ db$agecat1>=17] <- NA
describe(db$agecat1)

And I'm back to square 1, even if I used "attach()", I still have to use this painful db$variable syntax.

I read Post about attach(), Peter Ellis suggest attach as a good way to reproduce a "stata-like" environment but Brian Diggs explains very well my problem. The alternatives offered (with() and data=) are only ponctual and need to be repeated for each function (if I understood well) and thus are even more tedious than what I want to avoid.

Any way to work "inside" my database ?

  • Maybe check out `within`, you can use `{ ... }` to perform many operations. I highly recommend staying away from `attach`. Typing a couple of extra characters is worth it to avoid the pain caused by this function. Also, note that there is tab completion, so you can type `db$a` and fill stuff in. – lmo Dec 12 '17 at 13:02
  • Also, `describe` is not a base R function. If you are using some of the Hadley tools, there are often methods to avoid typing extra stuff, like with the `magrittr` pipes `%>%`. Or there's the `data.table` package, which has a bunch of stuff that has a similar flavor to Stata, like `db[, meanVar=mean(agecat), by=myGroup]` which is the analog of `collapse` or `db[, meanVar:=mean(agecat), by=myGroup]` which is the analog of `by: (e)gen` commands. – lmo Dec 12 '17 at 13:04
  • Also if you find yourself doing the same things repeatedly consider writing a function to accomplish the task. It can save some typing and make the code easier to read and more consistent. For example if you find yourself adding in NAs outside of certain limits (for whatever reason) a lot you could write a simple function `na_outside <- function(x, lower, upper){x[x <= lower] <- NA;x[x>=upper]<-NA;x}` and then use that instead of doing it all 'by hand'. – Dason Dec 12 '17 at 13:07
  • On a side note the way you named your variable makes it sounds like you're trying to make your own variables to break a different variable into categories. There are better ways to do that if that's what you're trying to achieve. – Dason Dec 12 '17 at 13:08
  • @Dason Thx for the help, I get that there is no easy and foolproof way to avoid the db$, I'll have to get used to it. As for the suggestions, the variables I work with are too different to allow me to use functions (and the NA exclusion is a one time action). – gabriel fernandez Dec 19 '17 at 10:41
  • @Dason I'm very interested in any advices you may have about breaking variables in categories. The age was done using "floor", but I have other variables to manipulate such as scores to break (or not) and values to regroup (due to too many of them). – gabriel fernandez Dec 19 '17 at 10:47

0 Answers0