deleting rows in R containing one blank column

Question

I have a data set like:

  ID               EMAIL_ID.x        ID.y     EMAIL_ID.y
 60842               k@aol.com      60842        k@aol.com
 90853                a.b.c         NA            <NA>
 90854                b.c.d         NA            <NA>
 83907               h@gwi.net      02854         <NA>
 83908             s@hotmail.com    00952         <NA>
 83909             l@rediff.com     78895         <NA>

In this data, i only want to retain the rows in which both the 3rd and the 4th column are either populated or both not populated.

Hence the result should be like:

   ID               EMAIL_ID.x       ID.y     EMAIL_ID.y
 60842               k@aol.com      60842        k@aol.com
 90853                a.b.c         NA            <NA>
 90854                b.c.d         NA            <NA>

In the above, the rows where column 3 had values but column 4 did not, have been deleted.

Please suggest how can this be done in R.

Thankyou!

assuming that all NAs are actual NAs and not strings as they are currently presented in the example — Sotos, Sep 14 '16 at 13:07
You could try to use the `na.strings` parameter when reading the file. For example: `na.strings = c('NA','')`. Otherwise, please include the output of `dput(head(name_of_your_df)` in your question. — Jaap, Sep 14 '16 at 13:18
@ProcrastinatusMaximus Sorry I think the dupe is not correct as the question is asking for a different logic `either populated or both not populated`. — akrun, Sep 14 '16 at 13:30
@ProcrastinatusMaximus i.e. `either populated or both not populated` — akrun, Sep 14 '16 at 13:32
@akrun well the answer is certainly the same or very very similar, which makes it a dupe — Cath, Sep 14 '16 at 13:33
@ProcrastinatusMaximus The logic in that dupe is `d[rowSums(is.na(d[1:3]))<2,]` ie. less than 2 NAs, but here the logic is different — akrun, Sep 14 '16 at 13:34
@akrun no, the numbers are different but the logic (row-wise counting of NAs) is exacly the same. Just a variation of the same thing. — Cath, Sep 14 '16 at 13:35
@ProcrastinatusMaximus Are you saying that all logic that use `rowSums` is a dupe. — akrun, Sep 14 '16 at 13:36
@ProcrastinatusMaximus All I am saying is the one that was closed [recently](http://stackoverflow.com/questions/39490749/introducing-na-to-as-numeric) by RHertel is an exact dupe and not this — akrun, Sep 14 '16 at 13:41

akrun · Accepted Answer · 2016-09-14T13:17:15.527

One option is to check whether there are NA elements (is.na(...)) in both (&) columns or (|) both non-NA elements (by negating !) and use that logical index to subset the rows.

df1[is.na(df1[,3]) & is.na(df1[,4]) | !is.na(df1[,3]) & !is.na(df1[,4]),]
#    ID EMAIL_ID.x JPNUMBER.y EMAIL_ID.y
#1 60842  k@aol.com      60842  k@aol.com
#2 90853      a.b.c         NA       <NA>
#3 90854      b.c.d         NA       <NA>

Or another option is to apply rowSums on the logical matrix (is.na(df1[3:4])), check if the sum of NA elements are either 0 or 2 using %in% (if it is 0 - it implies there are no NA elements and 2 means both the columns have NA) and use the logical vector to subset the rows.

df1[rowSums(is.na(df1[3:4])) %in% c(0,2),]
#    ID EMAIL_ID.x JPNUMBER.y EMAIL_ID.y
#1 60842  k@aol.com      60842  k@aol.com
#2 90853      a.b.c         NA       <NA>
#3 90854      b.c.d         NA       <NA>

Or another option is Reduce with lapply (to avoid conversion to matrix - in case the dataset is really big)

df1[Reduce(`+`, lapply(df1[3:4], is.na)) != 1,]

NOTE: If the NA in the OP's dataset is not real NA, we need to convert it to real NA before doing this (assuming the 4th column is character class)

is.na(df1[,4]) <- df1[,4] == "<NA>"

data

df1 <- structure(list(ID = c(60842L, 90853L, 90854L, 83907L, 83908L, 
83909L), EMAIL_ID.x = c("k@aol.com", "a.b.c", "b.c.d", "h@gwi.net", 
"s@hotmail.com", "l@rediff.com"), JPNUMBER.y = c(60842L, NA, 
NA, 2854L, 952L, 78895L), EMAIL_ID.y = c("k@aol.com", NA, NA, 
NA, NA, NA)), .Names = c("ID", "EMAIL_ID.x", "JPNUMBER.y", "EMAIL_ID.y"
), row.names = c(NA, -6L), class = "data.frame")

deleting rows in R containing one blank column

1 Answers1

data