7

I have to remove columns in my dataframe which has over 4000 columns and 180 rows.The conditions I want to set in to remove the column in the dataframe are: (i) Remove the column if there are less then two values/entries in that column (ii) Remove the column if there are no two consecutive(one after the other) values in the column. (iii) Remove the column having all values as NA. I have provided with conditions on which a column is to be deleted. The aim here is not just to find a column by its name like in "How do you delete a column in data.table?". I Illustrate as follows:

A       B    C   D  E
0.018  NA    NA  NA NA
0.017  NA    NA  NA NA
0.019  NA    NA  NA NA
0.018  0.034 NA  NA NA
0.018  NA    NA  NA NA
0.015  NA    NA  NA 0.037
0.016  NA    NA  NA 0.031
0.019  NA    0.4 NA 0.025
0.016  0.03  NA  NA 0.035
0.018  NA    NA  NA 0.035
0.017  NA    NA  NA 0.043
0.023  NA    NA  NA 0.040
0.022  NA    NA  NA 0.042

Desired dataframe:

A       E
0.018   NA
0.017   NA
0.019   NA
0.018   NA
0.018   NA
0.015   0.037
0.016   0.031
0.019   0.025
0.016   0.035
0.018   0.035
0.017   0.043
0.023   0.040
0.022   0.042

How can I incoporate these three conditions in one code. I would appreciate your help in this regard. Reproducible example

structure(list(Month = c("Jan-2000", "Feb-2000", "Mar-2000", 
"Apr-2000", "May-2000", "Jun-2000"), A.G.L.SJ.INVS...LON..DEAD...13.08.15 = c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), ABACUS.GROUP.DEAD...18.02.09 = c(0.00829384766220866, 
0.00332213653674028, 0, 0, NA, NA), ABB.R..IRS. = c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_)), .Names = c("Month", 
"A.G.L.SJ.INVS...LON..DEAD...13.08.15", "ABACUS.GROUP.DEAD...18.02.09", 
"ABB.R..IRS."), class = c("data.table", "data.frame"), row.names = c(NA, 
-6L), .internal.selfref = <pointer: 0x0000000001c90788>)
Aquarius
  • 262
  • 1
  • 6
  • 20
  • 2
    For (i): `df[,sapply(df, function(x) sum(!is.na(x))>1)]` – jogo Jan 20 '16 at 14:46
  • @Frank the removal of columns here is based on conditions not the name. – Aquarius Jan 22 '16 at 14:49
  • @Aquarius In the linked q, those are also conditions. They are just conditions based on names rather than the vector of values. I do not see it as essentially different. I also think the target link is helpful because it shows `:= NULL`, which is how to remove columns by reference (instead of making an entirely new table). – Frank Jan 22 '16 at 14:51
  • @Frank well I think its bit harsh decision here. – Aquarius Jan 22 '16 at 14:52
  • You shouldn't see it that way. There's nothing wrong with asking a question that gets marked as a dupe. Dupes and their answers also help the site (since people can find your question in google results too). Marking as a dupe just makes it easier to find related questions, I think. Here's a blog post by the site's cofounder about it: http://blog.stackoverflow.com/2010/11/dr-strangedupe-or-how-i-learned-to-stop-worrying-and-love-duplication/ – Frank Jan 22 '16 at 14:53
  • @Frank alright if it still helps others and site. I'm ok with it but reconsideration wouldn't have been bad. – Aquarius Jan 22 '16 at 14:56
  • Yeah, I mentioned it to David (the answerer below, who will be back Monday) and he can easily reverse it if he thinks differently than me, as might others before then. – Frank Jan 22 '16 at 14:57
  • @Frank I'm having a bad day at stackoverflow today. – Aquarius Jan 22 '16 at 14:59
  • @Frank This isn't a dupe. Not even close. The idea here was to computationally create conditions, not just remove column by *any* condition. OP couldn't' figure out fro the dupe you linked he will need to do `colSums`, `diff`, `all`, or anything else. The fact the column is being removed eventually is just last part of this logical process. – David Arenburg Jan 23 '16 at 22:58

2 Answers2

8

I feel like this is all over-complicated. Condition 2 already includes all the rest of the conditions, as if there are at least two non-NA values in a column, obviously the whole column aren't NAs. And if there are at least two consecutive values in a column, then obviously this column contains more than one value. So instead of 3 conditions, this all sums up into a single condition (I prefer not to run many functions per column, rather after running diff per column- vecotrize the whole thing):

cond <- colSums(is.na(sapply(df, diff))) < nrow(df) - 1

This works because if there are no consecutive values in a column, the whole column will become NAs.

Then, just

df[, cond, drop = FALSE]
#        A     E
# 1  0.018    NA
# 2  0.017    NA
# 3  0.019    NA
# 4  0.018    NA
# 5  0.018    NA
# 6  0.015 0.037
# 7  0.016 0.031
# 8  0.019 0.025
# 9  0.016 0.035
# 10 0.018 0.035
# 11 0.017 0.043
# 12 0.023 0.040
# 13 0.022 0.042

Per your edit, it seems like you have a data.table object and you also have a Date column so the code would need some modifications.

cond <- df[, lapply(.SD, function(x) sum(is.na(diff(x)))) < .N - 1, .SDcols = -1] 
df[, c(TRUE, cond), with = FALSE]

Some explanations:

  • We want to ignore the first column in our calculations so we specify .SDcols = -1 when operating on our .SD (which means Sub Data in data.tableis)
  • .N is just the rows count (similar to nrow(df)
  • Next step is to subset by condition. We need not forget to grab the first column too so we start with c(TRUE,...
  • Finally, data.table works with non standard evaluation by default, hence, if you want to select column as if you would in a data.frame you will need to specify with = FALSE

A better way though, would be just to remove the column by reference using := NULL

cond <- c(FALSE, df[, lapply(.SD, function(x) sum(is.na(diff(x)))) == .N - 1, .SDcols = -1])
df[, which(cond) := NULL]
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/101301/discussion-between-aquarius-and-david-arenburg). – Aquarius Jan 21 '16 at 13:03
6

Create logical vectors for each condition:

# condition 1
cond1 <- sapply(df, function(col) sum(!is.na(col)) < 2)

# condition 2
cond2 <- sapply(df, function(col) !any(diff(which(!is.na(col))) == 1))

# condition 3
cond3 <- sapply(df, function(col) all(is.na(col)))

Then combine them into one mask:

mask <- !(cond1 | cond2 | cond3)

> df[,mask,drop=F]
       A     E
1  0.018    NA
2  0.017    NA
3  0.019    NA
4  0.018    NA
5  0.018    NA
6  0.015 0.037
7  0.016 0.031
8  0.019 0.025
9  0.016 0.035
10 0.018 0.035
11 0.017 0.043
12 0.023 0.040
13 0.022 0.042
Zelazny7
  • 39,946
  • 18
  • 70
  • 84
  • FYI, you only need condition 2 here which could be simplified to `cond2 <- sapply(df, function(col) any(!is.na(diff(col))))` and you good to go. All the rest of the conditions are redundant. – David Arenburg Jan 20 '16 at 16:30
  • 2
    The OP seems new to R so I opted for a clearer, if redundant, answer that hopefully demonstrates a pattern that can be useful in the future. And while the conditions can be combined in *this* example, that may not always be the case. – Zelazny7 Jan 20 '16 at 16:54
  • I take all the blame for providing inefficient information to execute this task.I'm new here and inexperienced with the software. My apologies but Answer 2 fits best to what I had asked.I hope you will understand. – Aquarius Jan 21 '16 at 13:00