Selecting columns based on pattern

Question

Possible duplicate: Extracting specific columns from a data frame

I have a data frame that follows a pattern in the columns. Here I have some 10 columns but actually in the final data frame the number of columns is not know as it depends on the data given.

  V1    V2   V3          V4       V5   V6          V7     V8    V9          V10           
ADAM32  P 0.001000000   40.61038  P 0.001000000   40.61038  P 0.001000000   40.61038
CCL5    P 0.000491000 6546.20000  P 0.000491000 6546.20000  P 0.000491000 6546.20000
CILP2   A 0.500000024   92.66398  A 0.500000024   92.66398  A 0.500000024   92.66398
EPHB3   P 0.000562000  461.30000  P 0.000562000  461.30000  P 0.000562000  461.30000
GUCA1A  P 0.002006000    9.40000  P 0.002006000    9.40000  P 0.002006000    9.40000
HSPA6   P 0.000322000  564.00000  P 0.000322000  564.00000  P 0.000322000  564.00000
MAPK1   P 0.002000000  435.00000  P 0.002000000  435.00000  P 0.002000000  435.00000
PIGX    P 0.003822926  411.38856  P 0.003822926  411.38856  P 0.003822926  411.38856
PTPN21  M 0.051040220   94.30000  M 0.051040220   94.30000  M 0.051040220   94.30000
THRA    M 0.054470000  151.10000  M 0.054470000  151.10000  M 0.054470000  151.10000
UBA7    P 0.000468000  845.60000  P 0.000468000  845.60000  P 0.000468000  845.60000
WFDC2   P 0.005475547  177.61689  P 0.005475547  177.61689  P 0.005475547  177.61689
7-Mar   P 0.000673000  643.20000  P 0.000673000  643.20000  P 0.000673000  643.20000

In the above data frame I want the first two columns and then column after two columns, column after two columns and so on in the same fashion. Therefore I want v1,v2,v5,v8 and so on till the data frame is exhausted. So if I have a data frame of 1000 columns in the same pattern, how can I select the columns?

The expected output:

     V1 V2  V5  V8
 ADAM32  P   P  P
   CCL5  P   P  P 
  CILP2  A   A  A
  EPHB3  P   P  P
 GUCA1A  P   P  P
  HSPA6  P   P  P
  MAPK1  P   P  P
   PIGX  P   P  P
 PTPN21  M   M  M
   THRA  M   M  M
   UBA7  P   P  P
  WFDC2  P   P  P
  7-Mar  P   P  P

Colonel Beauvel · Accepted Answer · 2015-04-04T16:53:37.547

5

If the criterion is to select only colomuns which are not numeric, you can use filter:

Filter(Negate(is.numeric), df)

Example on dummy data:

df = data.frame('a','b',1,2,'c',23,45.0,'c')
Filter(function(u) !is.numeric(u), df)
#  X.a. X.b. X.c. X.c..1
#1    a    b    c      c

To select the first column, the second, the fifth, the eight and so on, you can also try:

df[,c(1,(1:ceiling(length(df)/3))*3-1)]

edited Apr 04 '15 at 16:53

answered Apr 04 '15 at 16:42

Colonel Beauvel

30,423
11
47
87

By the way nice thought.. i think its quite pretty. – Agaz Wani Apr 04 '15 at 16:45
it's an alternative proposal, but elegant only if you're sure your figures are effectively numeric and not formated as character ;) – Colonel Beauvel Apr 04 '15 at 16:53
thks akrun, this is effectively awsome this way! – Colonel Beauvel Apr 04 '15 at 16:53
My data frame was in character format, then i changed to numeric .. so it works fine. Thanks a lot – Agaz Wani Apr 04 '15 at 16:59
Yes, this implicit check is needed. It's another criteria than selecting the sequence, but quite neat! Glad to help! – Colonel Beauvel Apr 04 '15 at 17:00
you mean `Negate(is.numeric)` ? It speaks by itself: you check is the element is numeric and negate this last result. So the function return True on a non numeric element: `Negate(is.numeric)('a')` – Colonel Beauvel Apr 04 '15 at 17:12
I mean about last line of code that u added. `df[,c(1,(1:ceiling(length(df)/3))*3-1)]` – Agaz Wani Apr 04 '15 at 17:14
Ah, you want column 1 and then `2, 5, 8, 11 .... = 3-1, 6-1, 9-1, 12-1 ... = 3*(1 ,2, 3, 4...) - 1`. But it's preferable to use `c(1,2,seq(5,ncol(df),3))`, more readable as @LyzandeR proposed. Beeing a mathematician, I tend to write done formulas ;) – Colonel Beauvel Apr 04 '15 at 17:19

score 2 · Answer 2 · answered Apr 04 '15 at 16:46

2

The seq function can help with this in the following way:

df <- read.table('clipboard',header=T)

df[, c(1,2,seq(5,ncol(df),3))]

       V1 V2 V5 V8
1  ADAM32  P  P  P
2    CCL5  P  P  P
3   CILP2  A  A  A
4   EPHB3  P  P  P
5  GUCA1A  P  P  P
6   HSPA6  P  P  P
7   MAPK1  P  P  P
8    PIGX  P  P  P
9  PTPN21  M  M  M
10   THRA  M  M  M
11   UBA7  P  P  P
12  WFDC2  P  P  P
13  7-Mar  P  P  P

Essentially seq creates the sequence as you want it i.e. starts from 5 until the total number of columns and returns a column index every two columns. In this I just added the first and second columns as you wanted.

answered Apr 04 '15 at 16:46

LyzandeR

37,047
12
77
87

Thanks for your comments. U code gives the result as expected but throws the below warning; `Warning in run(timeoutMs) : incomplete final line found by readTableHeader on 'clipboard'` – Agaz Wani Apr 04 '15 at 17:04
I copied your example data (which saved it in the clipboard) and then used `read.table` with the first argument being the 'clipboard'; i.e. I essentially read your sample table. Make sure you do the same otherwise I don't know what will be stored in the clipboard :) – LyzandeR Apr 04 '15 at 17:07

Selecting columns based on pattern

2 Answers2