1

I have a data frame with a column of factors and columns with values for each factor plus additional factors that are no longer included in the data frame. Example:

x <- data.frame(f= toupper(sample(letters[1:3], 5, replace=T)), 
            x.A = seq(1:5),
            x.B = seq(1:5),
            x.C = seq(1:5),
            x.D = seq(1:5),
            x.E = seq(1:5))

Resulting in:

  f x.A x.B x.C x.D x.E
1 B   1   1   1   1   1
2 B   2   2   2   2   2
3 A   3   3   3   3   3
4 C   4   4   4   4   4
5 A   5   5   5   5   5

Now I want to remove all columns that do not represent a current level in column f, resulting in a data frame:

  f x.A x.B x.C
1 B   1   1   1
2 B   2   2   2
3 A   3   3   3
4 C   4   4   4
5 A   5   5   5

Naming convention is consistent among levels and column names, and names always take the form somevariable.FACTORLEVEL. I would type all the names in a list to choose from, but it gets long and unwieldy. I tried using grep as follows:

subX <- x[x$f == 'B', grep('B', names(x))]

But don't quite get what I want and don't know how to extend that over all levels if it did work. I also looked at previous questions here and here, but they don't go as far as I need. Any help would be appreciated. Thanks.

Community
  • 1
  • 1
tjr
  • 607
  • 2
  • 7
  • 20

2 Answers2

2

We use sub to remove the prefix x. from the column names of 'x', check whether it is %in% the 'f' column to create a logical vector and use this to subset the columns of 'x'. We removed the first column name (as it is 'f') and later concatenated with TRUE to include that column also in the subset.

 x[c(TRUE,sub('.*\\.', '', names(x)[-1]) %in% x$f)]

Or we can use grepl with pattern specified by pasteing the 'f' column to return a logical index as before.

x[c(TRUE,grepl(paste(x$f, collapse='|'), names(x)[-1]))]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks! Is it possible to only do that over a range of columns? So if there were columns I wanted to keep no matter what and only remove from a set. So in this case x[,2:6] – tjr Nov 23 '15 at 18:42
  • @tjr Suppose you want to say check the columns 3:5 and need columns 1:2 and 6 along with the columns returned by the comparison, try `x[c(rep(TRUE,2), sub('.*\\.', '', names(x)[3:5]) %in% x$f, TRUE)]` – akrun Nov 23 '15 at 18:48
  • 1
    Thanks. Sorry I should have tried before asking the question. I found that works well. Also, I realized my question may be confusing using x as the dataframe and name, so future readers watch for that. – tjr Nov 23 '15 at 18:52
2

This would also work.

x[c(T, (gsub("x.", "", names(x)) %in% x$f)[-1])]
akrun
  • 874,273
  • 37
  • 540
  • 662
entropium
  • 29
  • 4