0

I have a dataframe with 5000 columns and 1000 rows but a lot of data are missing. I'm interested only on columns that have at least 10 data. How to remove columns with less than 10 values? Thank you!

I tried:

miss <- c()
for(i in 1:ncol(test)) {
   if(length(test[,i]) < 10)
  miss <- append(miss,i)
 }
data2 <- test[,-miss]

Error in -miss : invalid argument to unary operator

SmokeyShakers
  • 3,372
  • 1
  • 7
  • 18

3 Answers3

4

Take this example data:

df <- data.frame(v1 = c(1:5, NA),
                 v2 = c(1:5, NA),
                 V3 = rep(NA, 6),
                 v4 = c(1:5, NA))
df
  v1 v2 V3 v4
1  1  1 NA  1
2  2  2 NA  2
3  3  3 NA  3
4  4  4 NA  4
5  5  5 NA  5
6 NA NA NA NA

You can use is.na() inside colSums() to count the number of missing values in each column.

colSums(is.na(df))
v1 v2 V3 v4 
 1  1  6  1 

Then keep variables according to a given threshold of missing values as follows:

df[, colSums(is.na(df)) < 5]
  v1 v2 v4
1  1  1  1
2  2  2  2
3  3  3  3
4  4  4  4
5  5  5  5
6 NA NA NA
L--
  • 565
  • 1
  • 12
0

Here is a tidyversealternative:

library(purrr)
library(dplyr)

df %>% 
  map_df(~sum(is.na(.))) %>% 
  stack() %>%  
  filter(values < 5)

outcome using @L-- data:

  values ind
1      1  v1
2      1  v2
3      1  v4
TarJae
  • 72,363
  • 6
  • 19
  • 66
0

This is pretty simple to do using dplyr (using sample data provided by L--):

library(dplyr)

df %>% 
  select(where(~ sum(!is.na(.)) >= 5)) # change 5 to 10 in your code

Output

  v1 v2 v4
1  1  1  1
2  2  2  2
3  3  3  3
4  4  4  4
5  5  5  5
6 NA NA NA
LMc
  • 12,577
  • 3
  • 31
  • 43