How to find a common variable in a large number of databases using Stata

Question

So I have a large number of databases (82) in Stata, that each contain around 1300 variables and several thousand observations. Some of these databases contain variables that give the mean or standard deviation of certain concepts. For example, a variable in such a dataset could be called "leverage_mean". Now, I want to know which datasets contain variables called concept_mean or concept_sd, without having to go through every dataset by hand.

I was thinking that maybe there is a way to loop through the databases looking for variables containing "mean" or "sd", unfortunately I have idea how to do this. I'm using R and Stata datafiles.

Welcome to StackOverflow. Please take a look at these tips on how to produce a [minimum, complete, and verifiable example](http://stackoverflow.com/help/mcve), as well as this post on [creating a great example in R](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Perhaps the following tips on [asking a good question](http://stackoverflow.com/help/how-to-ask) may also be worth a read. — lmo, Jun 29 '16 at 13:22

score 1 · Accepted Answer · answered Jun 29 '16 at 14:15

Yes, you can do this with a loop in stata as well as R. First, you should check out the stata command ds and the package findname, which will do many of the things described here and much more. But to show you what is happening "under the hood", I'll show the Stata code that can achieve this below:

/*Set your current directory to the location of your databases*/
cd "[your cd here]"

Save the names of the 82 databases to a list called "filelist" using stata's dir function for macros. NOTE: you don't specify what kind of file your database files are, so I'm assuming .xls. This command saves all files with extension ".xls" into the list. What type of file you save into the list and how you import your database will depend on what type of files you are reading in.

local filelist : dir . files "*.xls"

Then loop over all files to show which ones contain variables that end with "_sd" or "_mean".

foreach file of local filelist {
    /*import the data*/
    import excel "`file'", firstrow clear case(lower)

    /*produce a list of the variables that end with "_sd" and "_mean"*/

    cap quietly describe *_sd *_mean, varlist
    if length("r(varlist)") > 0 {

    /*If the database contains variables of interest, display the database file name and variables on screen*/

        display "Database `file' contains variables: " r(varlist)
    }
}

Final note, this loop will only display the database name and variables of interest contained within it. If you want to perform actions on the data, or do anything else, those actions need to be included in the position of the final "display" command (which you may or may not ultimately actually need).

I see, would it be the same idea in R? Could you show me? Sorry, I'm new to this. — Oscar, Jun 29 '16 at 17:05
So in R I tried this: (f <- file.path("C:\\User\\Desktop\\DATA")) d <- lapply(f, read.dta) So for each file I loop and read it into R using the function read.dta. Now could I somehow do something like this? e <- lapply(d, findname if containing _mean,....) or would it then loop through the file names instead? — Oscar, Jun 29 '16 at 21:03
I would try building the list of files using the function `list.files()` then `apply()` on that list. I'm less proficient with R and I haven't tested that code so let me know how it goes... — enRANDOMSTRING, Jun 30 '16 at 01:41
The the function `names()` will give you the list of variables contained in the data. If you actually prefer an answer in R, you might want to consider editing your question slightly. — enRANDOMSTRING, Jun 30 '16 at 01:49
So I got this now: temp = list.files(pattern="*.dta") then I do for (i in 1:length(temp)) assign(temp[i], read.dta13(temp[i])) and use grep(pattern="_m", temp, value=TRUE), but this gves me filenames that include _m, however I want the variable names that end with _m and the dataset that they're in — Oscar, Jun 30 '16 at 11:07

score 1 · Answer 2 · answered Jun 29 '16 at 15:11

You can use filelist, (from SSC) to create a dataset of files. To install filelist, type in Stata's Command window:

ssc install filelist

With a list of datasets in memory, you can then loop over each file and use describe to get a list of variables for each file. You can store this list of variables in a single string variable. For example, the following will collect the names of all Stata datasets shipped with Stata and then store for each the variables they contain:

findfile "auto.dta"
local base_dir = subinstr("`r(fn)'", "/a/auto.dta", "", 1)

dis "`base_dir'"
filelist, dir("`base_dir'") pattern("*.dta")

gen variables = ""
local nmatch = _N
qui forvalues i = 1/`nmatch' {
    local f = dirname[`i'] + "/" + filename[`i']
    describe using "`f'", varlist
    replace variables = " `r(varlist)' " in `i'
}
leftalign  // also from SSC, to install: ssc install leftalign

Once you have all this information in the data in memory, you can easily search for specific variables. For example:

. list filename if strpos(variables, " rep78 ")

     +-----------+
     | filename  |
     |-----------|
 13. | auto.dta  |
 14. | auto2.dta |
     +-----------+

score 1 · Answer 3 · answered Feb 10 '22 at 19:00

1

The lookfor_all package (SSC) is there for that purpose:

cd "pathtodirectory"

lookfor_all leverage_mean

Just make sure the file extensions are in lowercase(.dta) and not upper.

answered Feb 10 '22 at 19:00

Nile

89
1
5

How to find a common variable in a large number of databases using Stata

3 Answers3