0

I have a data set like:

  ID               EMAIL_ID.x        ID.y     EMAIL_ID.y
 60842               k@aol.com      60842        k@aol.com
 90853                a.b.c         NA            <NA>
 90854                b.c.d         NA            <NA>
 83907               h@gwi.net      02854         <NA>
 83908             s@hotmail.com    00952         <NA>
 83909             l@rediff.com     78895         <NA>

In this data, i only want to retain the rows in which both the 3rd and the 4th column are either populated or both not populated.

Hence the result should be like:

   ID               EMAIL_ID.x       ID.y     EMAIL_ID.y
 60842               k@aol.com      60842        k@aol.com
 90853                a.b.c         NA            <NA>
 90854                b.c.d         NA            <NA>

In the above, the rows where column 3 had values but column 4 did not, have been deleted.

Please suggest how can this be done in R.

Thankyou!

eclairs
  • 1,515
  • 6
  • 21
  • 26
  • 3
    `df[rowSums(is.na(df[,3:4]))!=1,]` – Jaap Sep 14 '16 at 13:05
  • assuming that all NAs are actual NAs and not strings as they are currently presented in the example – Sotos Sep 14 '16 at 13:07
  • Trying the above gave me result like: – eclairs Sep 14 '16 at 13:13
  • NA NA for all rows – eclairs Sep 14 '16 at 13:15
  • 1
    You could try to use the `na.strings` parameter when reading the file. For example: `na.strings = c('NA','')`. Otherwise, please include the output of `dput(head(name_of_your_df)` in your question. – Jaap Sep 14 '16 at 13:18
  • @ProcrastinatusMaximus Sorry I think the dupe is not correct as the question is asking for a different logic `either populated or both not populated`. – akrun Sep 14 '16 at 13:30
  • @ProcrastinatusMaximus i.e. `either populated or both not populated` – akrun Sep 14 '16 at 13:32
  • 4
    which is still a rowwise NA count imo – Jaap Sep 14 '16 at 13:33
  • 4
    @akrun well the answer is certainly the same or very very similar, which makes it a dupe – Cath Sep 14 '16 at 13:33
  • 1
    @ProcrastinatusMaximus The logic in that dupe is `d[rowSums(is.na(d[1:3]))<2,]` ie. less than 2 NAs, but here the logic is different – akrun Sep 14 '16 at 13:34
  • 4
    @akrun no, the numbers are different but the logic (row-wise counting of NAs) is exacly the same. Just a variation of the same thing. – Cath Sep 14 '16 at 13:35
  • @ProcrastinatusMaximus Are you saying that all logic that use `rowSums` is a dupe. – akrun Sep 14 '16 at 13:36
  • 2
    @akrun no, see also Cath's comment – Jaap Sep 14 '16 at 13:38
  • @ProcrastinatusMaximus All I am saying is the one that was closed [recently](http://stackoverflow.com/questions/39490749/introducing-na-to-as-numeric) by RHertel is an exact dupe and not this – akrun Sep 14 '16 at 13:41

1 Answers1

1

One option is to check whether there are NA elements (is.na(...)) in both (&) columns or (|) both non-NA elements (by negating !) and use that logical index to subset the rows.

df1[is.na(df1[,3]) & is.na(df1[,4]) | !is.na(df1[,3]) & !is.na(df1[,4]),]
#    ID EMAIL_ID.x JPNUMBER.y EMAIL_ID.y
#1 60842  k@aol.com      60842  k@aol.com
#2 90853      a.b.c         NA       <NA>
#3 90854      b.c.d         NA       <NA>

Or another option is to apply rowSums on the logical matrix (is.na(df1[3:4])), check if the sum of NA elements are either 0 or 2 using %in% (if it is 0 - it implies there are no NA elements and 2 means both the columns have NA) and use the logical vector to subset the rows.

df1[rowSums(is.na(df1[3:4])) %in% c(0,2),]
#    ID EMAIL_ID.x JPNUMBER.y EMAIL_ID.y
#1 60842  k@aol.com      60842  k@aol.com
#2 90853      a.b.c         NA       <NA>
#3 90854      b.c.d         NA       <NA>

Or another option is Reduce with lapply (to avoid conversion to matrix - in case the dataset is really big)

df1[Reduce(`+`, lapply(df1[3:4], is.na)) != 1,]

NOTE: If the NA in the OP's dataset is not real NA, we need to convert it to real NA before doing this (assuming the 4th column is character class)

is.na(df1[,4]) <- df1[,4] == "<NA>"

data

df1 <- structure(list(ID = c(60842L, 90853L, 90854L, 83907L, 83908L, 
83909L), EMAIL_ID.x = c("k@aol.com", "a.b.c", "b.c.d", "h@gwi.net", 
"s@hotmail.com", "l@rediff.com"), JPNUMBER.y = c(60842L, NA, 
NA, 2854L, 952L, 78895L), EMAIL_ID.y = c("k@aol.com", NA, NA, 
NA, NA, NA)), .Names = c("ID", "EMAIL_ID.x", "JPNUMBER.y", "EMAIL_ID.y"
), row.names = c(NA, -6L), class = "data.frame")
akrun
  • 874,273
  • 37
  • 540
  • 662