0

So I have six large data sets, each starts off with around 250 predictors ( all the same initial predictors for each). An algorithm is run that removes a predictor from the data set if it does not fit certain criterion.

So for example, there is a predictor called X.50T

X.50T may be removed from the first data set and the second, but may not be removed from the other ones. The same is true for all the other predictors.

I want to know which predictors are contained in all six of my data sets.

How can this be done in R?

Essentially, corresponding to each data set is an outcome column ($d_{i}$)

Ie , I have for the six data sets six columns, $d_{1}$, $d_{2}$, $d_{3}$, $d_{4}$, $d_{5}$ and $d_{6}$

I want to make a new data frame that contains the above six columns AND the predictors, but only the predictors that appeared in ALL six of the data sets.

Each of the six data sets has between 1800-2000 rows. Each corresponding to row name. I also only want to include those rows for which this observation appears in all six. For example, the data frame has "row.names" 1,2,3....2000 with some missing in between. If I have say observation corresponding to row name "150" in all six data sets, I want to include, if it is missing in even one, I want to exclude.

So for example, lets say out of the 250 predictors, only 200 appear in all six data sets. The number of observations is around 2000. So I would want a 2000 by 206 matrix as my new data frame. But then I want to only include on rows those which appeared in all six, so it may be a smaller data frame, say 1800 x 206

Thanks

Quality
  • 113
  • 6
  • Please, read [this](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), to help to be helped. – s__ Oct 12 '18 at 06:48

1 Answers1

1

To get the column names of a dataframe, use names or colnames first, as in

cols <- colnames(df)

To get the intersection of column names, use intersect. For instance:

first <- c("Espresso", "Flat White", "Americano")
second <- c("Americano", "Espresso", "Tea")
intersect(first, second)
[1] "Espresso"  "Americano"

To do this nicely for multiple vectors of column names, try Reduce

third <- fourth <- fifth <- sixth <- first[-1]
third
[1] "Flat White" "Americano" 

final_columns <- Reduce(intersect, list(first, second, third, fourth, fifth, sixth))
final_columns 
[1] "Americano"

To manually add a few columns "by hand", use c(), as in

final_columns <- c("Bulletproof Coffee", final_columns)

Once this is done, just subset the original dataframe:

newdf <- original_df[, final_columns]

The same can be done for row names, though there are other ways, e.g. inner joins or merges to achieve the same result. In any case, the above should give you an idea of how to achieve your desired result.

coffeinjunky
  • 11,254
  • 39
  • 57