Pick out top 50% of data in every column

Question

Let say this is my matrix "marx" with nrow=400 ncol=250. I want to pick only a half of the data (top 50%) from every column (excluding NA)

          V272      V273       V274      V275       V276      V277
[1,] 0.2337847 0.2612946 0.41232797        NA 0.11931570 0.2543780
[2,] 0.3277191 0.3590431 0.06490879 0.2690663         NA 0.1632647
[3,]        NA 0.1536955 0.03604548 0.1361645         NA 0.2252554
[4,] 0.3483152 0.5342417 0.07404933        NA 0.14699876 0.2082977
[5,] 0.4213399 0.2511010 0.30502173 0.1189562 0.08962128 0.2919712
[6,] 0.1604953 0.2101048         NA        NA 0.01270747 0.2322928

I have tried with sample=length (x)/2 and loop, but still that does not work. Anyone has some thoughts?

Do you want to sample from each column independently and randomly? Or just have the first half of your matrix? Or Sort the data from each column and pick everything above the median? And what class is your data structure? You say "matrix" you call it `df` which implies a `data.frame`... — Gregor Thomas, Apr 28 '16 at 22:21
If you want to keep the data intact (row wise), one option is to do `complete.cases()` and then take how many ever of the remaining rows. `df <- df[complete.cases(df), ]; df <- df[sample(1:nrow(df), n), ]` where n is the number of rows you want. — Gopala, Apr 28 '16 at 22:23
I want to sort the data into descending order. Then pick a half of those numbers (exclude NA) out of each column. Oh sorry, my data is in matrix form. — beboo23, Apr 28 '16 at 22:29
Sorting into descending order by what column? Or, like @Gregor asked above, you want to do it column by column and not care for the relationship between them? Not at all clear what it is you want. — Gopala, Apr 28 '16 at 22:34
I mentioned it above. Sorting every column into descending order and yes i ignore the relation between them. Hope it's clear. — beboo23, Apr 28 '16 at 22:50

Gregor Thomas · Answer 1 · 2016-04-29T15:00:33.093

I would do it like this:

apply(x, 2, FUN = function(x) sort(x, decreasing = T)[1:floor(length(x)/2)])

Demonstration:

set.seed(47)
x = matrix(rnorm(100), 10)
x[1, 3] = NA
x
#              [,1]        [,2]        [,3]          [,4]        [,5]       [,6]        [,7]       [,8]
#  [1,]  1.99469634 -0.92245624          NA  0.4836041107  0.06116275  0.9697466  0.03838225  1.2174872
#  [2,]  0.71114251  0.03960243  0.24914817  0.1443376363 -0.10856462  1.6756248  0.06893424  0.7314502
#  [3,]  0.18540528  0.49382018 -0.34041599 -1.2004406274 -0.15469524  1.9882438  1.74017016  1.1339939
#  [4,] -0.28176501 -1.82822917  0.41719084  0.8852306473  0.95048417 -0.9870583  1.30627664  2.1879180
#  [5,]  0.10877555  0.09147291 -0.32646679  0.8869350447 -0.48769640 -1.8300307 -0.14493417  0.2212036
#  [6,] -1.08573747  0.67077922 -0.89029402  0.0006863592 -0.92024188  1.0081416  1.56234731 -0.9390224
#  [7,] -0.98548216 -0.08107805 -1.60815993 -0.6932373819  0.89797526 -0.8691044  1.24215371  0.8384429
#  [8,]  0.01513086  1.26424109 -2.32237229  0.2608364805 -0.35629514 -0.5151981  1.46129302  0.5291967
#  [9,] -0.25204590 -0.70338819 -1.96721918  0.5066869590  1.03190009 -0.5002165 -0.98583638 -1.0883085
# [10,] -1.46575030 -0.04057817  0.02752681  0.5643018376  0.66430042 -0.2725779  0.92561447 -0.7955874
#              [,9]        [,10]
#  [1,]  0.96832400  1.136878023
#  [2,]  0.18510415  0.004507257
#  [3,] -0.41257000  1.341705472
#  [4,] -0.83292772 -1.365424404
#  [5,]  0.95488318  0.926037646
#  [6,] -2.03609798 -0.497367640
#  [7,]  0.07445361 -0.860184103
#  [8,] -0.91453141 -0.060824754
#  [9,]  0.15602420  1.410276163
# [10,]  0.02934662  0.003944793

apply(x, 2, FUN = function(x) sort(x, decreasing = T)[1:floor(length(x)/2)])
#            [,1]       [,2]        [,3]      [,4]       [,5]       [,6]     [,7]      [,8]       [,9]
# [1,] 1.99469634 1.26424109  0.41719084 0.8869350 1.03190009  1.9882438 1.740170 2.1879180 0.96832400
# [2,] 0.71114251 0.67077922  0.24914817 0.8852306 0.95048417  1.6756248 1.562347 1.2174872 0.95488318
# [3,] 0.18540528 0.49382018  0.02752681 0.5643018 0.89797526  1.0081416 1.461293 1.1339939 0.18510415
# [4,] 0.10877555 0.09147291 -0.32646679 0.5066870 0.66430042  0.9697466 1.306277 0.8384429 0.15602420
# [5,] 0.01513086 0.03960243 -0.34041599 0.4836041 0.06116275 -0.2725779 1.242154 0.7314502 0.07445361
#            [,10]
# [1,] 1.410276163
# [2,] 1.341705472
# [3,] 1.136878023
# [4,] 0.926037646
# [5,] 0.004507257

Edit To return just half of the non-NA values:

apply(x, 2, FUN = function(x) sort(x, decreasing = T)[1:floor(sum(!is.na(x))/2)])

This will return a list where each item is a vector half the length (rounded down) of the number of non-missing values in each original column. If it happens that this length is the same for each column, it will be coerced to a matrix, unless that length is 1 in which case it will be a vector.

I was trying to think of a way to avoid `apply` and looping over rows, but I couldn't beat it for speed - `replace(x, TRUE, x[order(col(x), -x)])[1:floor(nrow(x)/2),]` was the best I could do. — thelatemail, Apr 29 '16 at 00:26
That `order(col(x), -x)` is very clever! I'm surprised it's not faster. Though it looks like `order` does its share of `apply`ing under the hood, so perhaps it's really not so different. — Gregor Thomas, Apr 29 '16 at 05:48
Since my NAs are interspersed throughout each column, the number of positive values are different in every column. Your suggestion cuts half of the whole data and leaves NA in column where the number of positives are the least. My desire result is to have only a half of positive numbers of each column without any NA. I think your suggestion is on the right path, but still instead working on each column, it works well only on the whole data. — beboo23, Apr 29 '16 at 09:25
I'm going to edit, but I'm tempted to tell you to ask a new question. This comment includes a lot of information that should have been included in your question & demonstrated in a reproducible example from the start. Returning a vector of a different length for each column is quite different. than returning rectangular data. I'd strongly encourage you to read up on [making good reproducible examples](http://stackoverflow.com/q/5963269/903061) and reproducibly share both input and desired output before you ask another question. — Gregor Thomas, Apr 29 '16 at 14:52

Heymans · Answer 2 · 2016-04-28T23:05:50.440

0

Look at using the head() function.

b <- data.frame(1:4, 2:8)
head(b, n = nrow(b/2))

This won't remove your NA's though, so you can do

head(b[!is.na(b[,1]),1], n = nrow(b)/2)

And iterate or use an apply function. Change the 1's in b[,1]),1] to be your columns. You will have a ragged array, since your NA's are interspersed throughout each column.

EDIT: Seeing your comment, you should use order, ie:

apply(b, 2, function(x) head(x[order(x, decreasing = TRUE)], n = length(x)/2))

edited Apr 28 '16 at 23:05

answered Apr 28 '16 at 22:27

Heymans

128
1
9

applying your formula, i have got this error message Error: length(n) == 1L is not TRUE – beboo23 Apr 28 '16 at 22:56
Replace nrow with length, and make sure to include that comma I missed: apply(b 2, function(x) head(x[order(x, decreasing = TRUE)], n = length(x)/2)) Sorry, double update, you need to also subset the matrix using the order, before calling head. – Heymans Apr 28 '16 at 23:03

Pick out top 50% of data in every column

2 Answers2