Finding columns in dataframe with specific number of values in R

Question

I have a dataframe(data.table) with over 3000 columns. I need to find out the columns in my dataframe that have only 2 and less than 2 values in them. I then after extracting those columns with 2 and less then 2 values I want to drop them out of the original data frame. I illustrate as follows: Original data frame

Month   A       B         C
Jan-00  0.007   NA       1758.27
Feb-00  0.004   NA       1310.43
Mar-00  0.004   NA       1260.89
Apr-00  0.004   0.0002   1137.34
May-00  0.005   6.05E-05 1595.78
Jun-00  0.003   NA       4968.89
Jul-00  0.007   NA       NA
Aug-00  0.005   NA       NA
Sep-00  0.004   NA       NA

Desired output

     Month    A         C
    Jan-00  0.007   1758.27
    Feb-00  0.004   1310.435
    Mar-00  0.004   1260.89
    Apr-00  0.004   1137.342105
    May-00  0.005   1595.78125
    Jun-00  0.003   4968.895238
    Jul-00  0.007   NA
    Aug-00  0.005   NA
    Sep-00  0.004   NA

I would appreciate your help in this regard.

Did you make any attempts on your own after asking a very similar question recently https://stackoverflow.com/questions/34902809 ? Did you get stuck somewhere in the process? — talat, Jan 22 '16 at 09:38
@docendodiscimus thank you for your valuable information. But that question is based on multiple conditions while mine is not. I'm new to the software. — Aquarius, Jan 22 '16 at 09:46
@mrub extracting example from a dataframe of over 3000 columns and 180 rows, which has the scenerio you want to present can be bit difficult at time. — Aquarius, Jan 22 '16 at 09:48
@Aquarius, well to me it looks like this question is exactly the same as your condition (i) in your previous question. And if you look at [this answer](http://stackoverflow.com/a/34903124/3521006), you can find all conditions adressed separately. — talat, Jan 22 '16 at 09:49
@Aquarius Then either use e.g. `dput(db[1:5,1:5])` or provide the dummy data in a format that can easily be imported by others. — mrub, Jan 22 '16 at 09:50

akrun · Accepted Answer · 2016-01-22T09:40:11.067

4

We can use Filter

Filter(function(x) sum(!is.na(x))>2, df1)
#   Month     A       C
#1 Jan-00 0.007 1758.27
#2 Feb-00 0.004 1310.43
#3 Mar-00 0.004 1260.89
#4 Apr-00 0.004 1137.34
#5 May-00 0.005 1595.78
#6 Jun-00 0.003 4968.89
#7 Jul-00 0.007      NA
#8 Aug-00 0.005      NA
#9 Sep-00 0.004      NA

Or

df1[colSums(!is.na(df1))>2]

If the dataset is data.table,

 library(data.table)
 setDT(df1)[,unlist(df1[, lapply(.SD, function(x) 
                        sum(!is.na(x))>2)]), with=FALSE]

Or

 setDT(df1)[, Filter(function(x) sum(!is.na(x))>2, .SD)]

edited Jan 22 '16 at 09:40

answered Jan 22 '16 at 09:32

akrun

874,273
37
540
662

Thank you for the answer and apologies for the duplication. Could you please tell me I'm unable to subet it this for example `df<-df1[,1:100]`. It only gives me a vector of 1:100 no rows. – Aquarius Jan 22 '16 at 10:21
1

@Aquarius If it is a `data.table`, then use `with=FALSE`, i.e. `df1[, 1:100, with=FALSE]` – akrun Jan 22 '16 at 10:23
Thank you it works. Could you please tell me what is the purpose of this `with=FALSE` part of code. – Aquarius Jan 22 '16 at 10:25
1

@Aquarius It is mentioned in `?data.table` i.e. `A single column name, single expresson of column names, list() of expressions of column names, an expression or function call that evaluates to list (including data.frame and data.table which are lists, too), or (when with=FALSE) a vector of names or positions to select.` – akrun Jan 22 '16 at 10:28
Could you please help me if I want to delete a column with all values as NA and Zero. – Aquarius Jan 22 '16 at 15:39
We can do `df1[!(colSums(is.na(df1))==nrow(df1)| colSums(df1, na.rm=TRUE)==0)]` I am assuming that all the columns are numeric. – akrun Jan 22 '16 at 15:41
My first column is date column – Aquarius Jan 22 '16 at 15:48
@Aquarius So, that column should be removed from the `colSums` as it is 'Date' class i..e. `dfN <- df1[-1]` and then do the `colSums` on `dfN`, subset the `dfN` columns and cbind with the first column. – akrun Jan 22 '16 at 15:49
Its not working the dimension of my original dataframe are 180 rows and 3590 columns (including date column). After removing the data column the resultant dataframe has 3584 rows. Is data.table causing problems – Aquarius Jan 22 '16 at 15:57
@Aquarius Okay, my solution was specific for `data.frame`. AS I mentioned earlier for subsetting `data.table`, we need to use `with=FALSE` – akrun Jan 22 '16 at 15:58
Honestly I used `with=FALSE` – Aquarius Jan 22 '16 at 16:00
@Aquarius Let me test it – akrun Jan 22 '16 at 16:01
1

Can you try `df2[,df2[, unlist(lapply(.SD, function(x) if(is.numeric(x)) sum(x, na.rm=TRUE)>0 else sum(!is.na(x))>0))],with=FALSE]` (`data.table` solution) – akrun Jan 22 '16 at 16:06

Finding columns in dataframe with specific number of values in R

1 Answers1