How to find common variables in a list of datasets & reshape them in R?

Question

    setwd("C:\\Users\\DATA")
    temp = list.files(pattern="*.dta")
    for (i in 1:length(temp)) assign(temp[i], read.dta13(temp[i], nonint.factors = TRUE))
    grep(pattern="_m", temp, value=TRUE)

Here I create a list of my datasets and read them into R, I then attempt to use grep in order to find all variable names with pattern _m, obviously this doesn't work because this simply returns all filenames with pattern _m. So essentially what I want, is my code to loop through the list of databases, find variables ending with _m, and return a list of databases that contain these variables.

Now I'm quite unsure how to do this, I'm quite new to coding and R.

Apart from needing to know in which databases these variables are, I also need to be able to make changes (reshape them) to these variables.

lmo · Answer 1 · 2016-06-30T16:14:13.580

0

Here is one way to figure out which files have variables with names ending in "_m":

# setup
setwd("C:\\Users\\DATA")
temp = list.files(pattern="*.dta")
# logical vector to be filled in
inFileVec <- logical(length(temp))

# loop through each file
for (i in 1:length(temp)) {
  # read file
  fileTemp <- read.dta13(temp[i], nonint.factors = TRUE)

  # fill in vector with TRUE if any variable ends in "_m"
  inFileVec[i] <- any(grepl("_m$", names(fileTemp)))
}

In the final line, names returns the variable names, grepl returns a logical vector for whether each variable name matches the pattern, and any returns a logical vector of length 1 indicating whether or not at least one TRUE was returned from grepl.

# print out these file names    
temp[inFileVec]

edited Jun 30 '16 at 16:14

answered Jun 30 '16 at 16:06

lmo

37,904
9
56
69

Thank you for answering! The last line 'temp[inFileVec]' just says 'character(0)' as output, what does that mean? – Oscar Jun 30 '16 at 16:26
Try printing out the names of the columns in the loop: `cat(names(fileTemp))`. – lmo Jun 30 '16 at 16:51
hmm it gives a bunch of variable names but none of them match the "_m'' pattern strangely.. – Oscar Jun 30 '16 at 18:08
So none of the variable names, perhaps translated when read in by `read.dta13`, use that pattern. Are there ".m" patterns at the end? If so, use "\\.m$" in `grepl` instead. – lmo Jun 30 '16 at 18:13
I'm sure there are _m patterns, it just gives the same variables no matter what I give as a pattern. Also it seems like fileTemp is just one of the datasets, so maybe there is something going wrong while looping through each file? – Oscar Jun 30 '16 at 20:59
Try manually reading in the first ten files of temp and see what the variable names are. – lmo Jun 30 '16 at 23:44

score 0 · Accepted Answer · edited May 23 '17 at 11:44

0

First, assign will not work as you think, because it expects a string (or character, as they are called in R). It will use the first element as the variable (see here for more info).

What you can do depends on the structure of your data. read.dta13 will load each file as a data.frame.

If you look for column names, you can do something like that:

myList <- character()
for (i in 1:length(temp)) {

    # save the content of your file in a data frame
    df <- read.dta13(temp[i], nonint.factors = TRUE))

    # identify the names of the columns matching your pattern
    varMatch <- grep(pattern="_m", colnames(df), value=TRUE)

    # check if at least one of the columns match the pattern
    if (length(varMatch)) {
        myList <- c(myList, temp[i]) # save the name if match
    }

}

If you look for the content of a column, you can have a look at the dplyr package, which is very useful when it comes to data frames manipulation.

A good introduction to dplyr is available in the package vignette here.

Note that in R, appending to a vector can become very slow (see this SO question for more details).

edited May 23 '17 at 11:44

Community

1
1

answered Jun 30 '16 at 16:24

paulwasit

416
2
12

Thank you for answering! But how do I actually view the list that we saved the names to? Right now it seems to be in my Global Environment tab below data under 'Values'. – Oscar Jun 30 '16 at 18:07
You are right, the names are in the vector myList. You can access it as you wish, but typing myList at the prompt will display its content in the console. – paulwasit Jun 30 '16 at 21:48
Okay this works, now I got a list of datasets containing the variables that I need. Now, what if from that new list I want to extract all columns from every dataset that are named 'a' , 'b' or contain pattern "_m" and put them in a new dataset? – Oscar Jul 01 '16 at 11:04
I thought maybe I could do this, after saving the content of the files in a dataframe, inside the loop by typing: m <- select(df, a, b, (grep("_m", colnames(df), value=TRUE))). But I guess that doesn't work... – Oscar Jul 01 '16 at 12:10
you would do the same as before, but with a different regex using the OR operator: `grep( '^(a|b|_m)*$', colnames(df), value=TRUE)` should work. It looks for at least one occurence (symbol *) of either a,b or _m (OR statements in the parenthesis) between the beg (symbol ^) and the end (symbol $) of each column name. – paulwasit Jul 01 '16 at 12:14
sorry I had misunderstood what you tried to do. What you did does not work because you cannot directly pass a vector as column names. Try using `select_(df, .dots = c("a","b", grep("_m", colnames(df), value=TRUE)))`. See the dplyr "nse" vignette here: https://cran.r-project.org/web/packages/dplyr/vignettes/nse.html – paulwasit Jul 01 '16 at 12:23

How to find common variables in a list of datasets & reshape them in R?

2 Answers2

Linked