Remove columns that have less than 10 data

Question

I have a dataframe with 5000 columns and 1000 rows but a lot of data are missing. I'm interested only on columns that have at least 10 data. How to remove columns with less than 10 values? Thank you!

I tried:

miss <- c()
for(i in 1:ncol(test)) {
   if(length(test[,i]) < 10)
  miss <- append(miss,i)
 }
data2 <- test[,-miss]

Error in -miss : invalid argument to unary operator

score 4 · Answer 1 · answered Mar 20 '23 at 20:23

Take this example data:

df <- data.frame(v1 = c(1:5, NA),
                 v2 = c(1:5, NA),
                 V3 = rep(NA, 6),
                 v4 = c(1:5, NA))
df

  v1 v2 V3 v4
1  1  1 NA  1
2  2  2 NA  2
3  3  3 NA  3
4  4  4 NA  4
5  5  5 NA  5
6 NA NA NA NA

You can use is.na() inside colSums() to count the number of missing values in each column.

colSums(is.na(df))

v1 v2 V3 v4 
 1  1  6  1

Then keep variables according to a given threshold of missing values as follows:

df[, colSums(is.na(df)) < 5]

Thanks! In my case df[, colSums(is.na(df)) < 990] works well — Corinne, Mar 21 '23 at 08:34

score 0 · Answer 2 · answered Mar 20 '23 at 20:36

0

Here is a tidyversealternative:

library(purrr)
library(dplyr)

df %>% 
  map_df(~sum(is.na(.))) %>% 
  stack() %>%  
  filter(values < 5)

outcome using @L-- data:

  values ind
1      1  v1
2      1  v2
3      1  v4

answered Mar 20 '23 at 20:36

TarJae

72,363
6
19
66

score 0 · Answer 3 · answered Mar 20 '23 at 20:50

This is pretty simple to do using dplyr (using sample data provided by L--):

library(dplyr)

df %>% 
  select(where(~ sum(!is.na(.)) >= 5)) # change 5 to 10 in your code

Output

Remove columns that have less than 10 data

3 Answers3