12

I have a data.frame with 15,000 observations of 34 ordinal and NA variables. I am performing clustering for a market segmentation study and need the rows with only NAs removed. After taking out the userID I got an error message saying to omit 2099 rows with only NAs before clustering.

I found a link for removing rows with all NA values, but I need to identify which of the 2099 rows have all NA values. Here the link for the discussion removing rows with all NA values: Remove Rows with NAs in data.frame

Here's a sample of the first five observations from six variables:

> head(Store2df, n=5)
  RowNo      Age Gender HouseholdIncome MaritalStatus PresenceofChildren
1     1     <NA>   Male            <NA>          <NA>               <NA>
2     2    45-54 Female            <NA>          <NA>               <NA>
3     3     <NA>   <NA>            <NA>          <NA>               <NA>
4     4     <NA>   <NA>            <NA>          <NA>               <NA>
5     5    45-54 Female        75k-100k       Married                Yes
#Making a vector
> Vector1 <- Store2df$RowNo 
#Taking out RowNo column
> Store2df$RowNo <- NULL

EDIT: I put the results in a object, but found that the code made an extra column. Clicking in RStudio's environment, an extra column called row.names was created labeling each row with the original row name. A couple thousand rows were deleted and the new column labeled the new rows with the old row number. However when looking at the head of the new object, I did not see the row label. Why does the row.names label show in the environment, but not when I view the head?

#Remove all rows with only NA values
> Store2df <- Store2[!!rowSums(!is.na(Store2)),]
#View head of store2df
> head(Store2df)
    Age Gender HouseholdIncome MaritalStatus PresenceofChildren
1  <NA>   Male            <NA>          <NA>               <NA>
2 45-54 Female            <NA>          <NA>               <NA>
5 45-54 Female        75k-100k       Married                Yes
6 25-34   Male        75k-100k       Married                 No
7 35-44 Female       125k-150k       Married                Yes
8 55-64   Male        75k-100k       Married                 No

EDIT 2: I put in the row number/userID column to keep track of the number of users. To perform the operation for removing all NAs, I took out the first column. Now I need to keep track of the users I removed. I have a list of over 2000 rows that had all NA values, I don't want to create an index manually putting in each row.

Question: How do I remove the emails that the missing data corresponded to?

> #First six rows of the column RowNo
> head(Store2df$RowNo)
[1] 1 2 3 4 5 6

I want 2099 rows deleted in the Store2df data.frame with the RowNo included. Here's the script identifying which rows are all empty in the Store2df data.frame without RowNo.

> which(rowSums(is.na(Store2df))==ncol(Store2df))

Showing the first 6 rows, row number 3 and 4 are deleted.

> head(Store2df$RowNo)
[1] 1 2 5 6 7 8

There are 4 steps I want to complete:

1) Take out RowNo column in Store2df data.frame and save as separate vector

2) Delete rows with all NA values in Store2df data.frame

3) Delete same rows in Store2new1 vector as Store2df data.frame

4) Combine vector and data.frame with vector matching the data.frame

Scott Davis
  • 983
  • 6
  • 22
  • 43
  • That link refers to dropping columns. Did you mean [this link](http://stackoverflow.com/questions/4862178/remove-rows-with-nas-in-data-frame)? And in what way does it not solve your problem? – David Robinson Sep 01 '14 at 05:10
  • @DavidRobinson I made the mistake of posting the wrong link, thank you. I meant to post that one but had too many tabs open! That link drops the rows for NA they know of. I do not want to go through all 15000 rows to identify which 2099 have all NA values. – Scott Davis Sep 01 '14 at 05:13
  • What do you mean "they know of"? The link doesn't require knowing which in advance. Having said that, the answers there don't quite answer your question since they remove rows with any NAs, not just with *all* NAs – David Robinson Sep 01 '14 at 05:23
  • @DavidRobinson I misread their subset on the complete.cases() function. They were selecting columns, not rows. I am trying to find an efficient way of taking out the rows I know to be all NAs. Subsetting this data.frame for all the filled values would take too much time. – Scott Davis Sep 01 '14 at 16:54
  • @Scott David I checked in RStudio Environment. I also noticed an extra column `row.names` is created, but in the Global Environment, it says: `Store2df 3 obs. of 7 variables`. – akrun Sep 09 '14 at 18:10
  • @Scott Davis I couldn't figure out why it creates the row.names column. Tried some modifications, but still in RStudio global environment had that row.names. – akrun Sep 09 '14 at 18:22
  • @akrun Ok. As long as it doesn't show in the head, I bet it's not a variable. Thank you again for following up! – Scott Davis Sep 09 '14 at 19:30
  • @Scott Davis Your description is confusing. `head(Store2df$RowNo) [1] 1 2 3 4 5 6 I want the same rows deleted in the Store2df data.frame deleted in the index Store2df$RowNo > head(Store2df$RowNo) [1] 1 2 5 6 7 8` – akrun Sep 20 '14 at 19:34
  • @Scott Davis I did Update2. Please check. Your description is confusing as you mentioned about a lot of things i.e Rstudio keeping row.names etc.. – akrun Sep 20 '14 at 19:51
  • @akrun I edited the post. Please let me know if I can clear anything up. I want to reduce the size of the vector `RowNo` in order to merge with `data.frame` Store2df. `RowNo` will refer to the correct information in Store2df. – Scott Davis Sep 20 '14 at 20:12
  • @Scott Davis Could you post this as a new question as this page has already became crowded with lot of comments. Also, when you post, make sure that you use `dput` ie. `dput(head(Store2df, 6))`. In the update, I showed you one way to remove the `RowNo`. But, it seems like you are not following it. In your recent update, you mentioned 4 steps to complete. I guess you already did a couple of steps completed. This is all confusing. – akrun Sep 20 '14 at 20:23
  • @Scott Davis The 3rd and 4th steps you mentioned are confusing. `3) Delete same rows in Store2df$RowNo as Store2df data.frame 4) Combine vector and data.frame with vector matching the data.frame` If `Store2df` is a single object and the rows are already deleted, I don't understand how you would delete it again. – akrun Sep 20 '14 at 20:34
  • @akrun sorry for the confusion. I made an edit changing the vector to a separate object. I can't remove the `RowNo`s I want because the `RowNo` column has no `NA`s. The command you posted only removes rows with just `Na`s. If you want I can repost this in another question. – Scott Davis Sep 20 '14 at 20:49
  • @Scott Davis After reading it again, I guess you meant `Store2` instead of `Store2df` for the 1st 2 requests because `Store2df` is the one you got after removing the NAs, right? Also, for the fourth step, if we already removed the corresponding rows in both the vector and dataframe ( here don't know which dataframe you are referring to), why do we need to merge? – akrun Sep 20 '14 at 20:53
  • @akrun yes that's right! – Scott Davis Sep 20 '14 at 20:57
  • @Scott Davis Okay, I updated. – akrun Sep 20 '14 at 21:03

2 Answers2

17
 which(rowSums(is.na(Store2))==ncol(Store2))
 #3 4 
 #3 4 

Or

 which(Reduce(`&`,as.data.frame(is.na(Store2))))
 #[1] 3 4

Or

 which(!rowSums(!is.na(Store2)))  
 #3 4 
 #3 4 

data

 Store2 <- structure(list(Age = c(NA, "45-54", NA, NA, "45-54"), Gender = c("Male", 
 "Female", NA, NA, "Female"), HouseholdIncome = c(NA, NA, NA, 
  NA, "75k-100k"), MaritalStatus = c(NA, NA, NA, NA, "Married"), 
PresenceofChildren = c(NA, NA, NA, NA, "Yes"), HomeOwnerStatus = c(NA, 
NA, NA, NA, "Own"), HomeMarketValue = c(NA, NA, NA, NA, "150k-200k"
)), .Names = c("Age", "Gender", "HouseholdIncome", "MaritalStatus", 
"PresenceofChildren", "HomeOwnerStatus", "HomeMarketValue"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5"))

Update

To drop the rows with all NAs

  Store2[!!rowSums(!is.na(Store2)),]
  #   Age Gender HouseholdIncome MaritalStatus PresenceofChildren HomeOwnerStatus
  #1  <NA>   Male            <NA>          <NA>               <NA>            <NA>
  #2 45-54 Female            <NA>          <NA>               <NA>            <NA>
  #5 45-54 Female        75k-100k       Married                Yes             Own
   #HomeMarketValue
  #1            <NA>
  #2            <NA>
  #5       150k-200k
  • is.na(Store2) gives a logical index of elements that are missing or NA
  • ! will negate the logical index i.e. TRUE becomes FALSE and viceversa
  • rowSums of the above code gives the sum of elements that are not NA in each row

        rowSums(!is.na(Store2))
        #   1 2 3 4 5 
        #   1 2 0 0 7  # 3rd and 4th row have `0 non NA` values
    
  • ! Negate the above gives

        !rowSums(!is.na(Store2))
        # 1     2     3     4     5 
        #FALSE FALSE  TRUE  TRUE FALSE 
    
  • We wanted to drop those rows that are all NA's or 0 non NAs. So ! again

        !!rowSums(!is.na(Store2))
        #1     2     3     4     5 
        #TRUE  TRUE FALSE FALSE  TRUE 
    
  • Subset using the above logical index

Update2

If you have two rowNo, i.e. the one you stored separately before deleting the NA rows and the second after you deleted the NAs.

   RowNo1 <- 1:6
   RowNo2 <- c(1,2,5,6)
   RowNo1 %in% RowNo2
   #[1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE
   RowNo1[RowNo1 %in% RowNo2]
   #[1] 1 2 5 6

Update3

With your new requests, let me try it again:

    Store2 <- structure(list(RowNo = 1:5, Age = c(NA, "45-54", NA, NA, "45-54"
    ), Gender = c("Male", "Female", NA, NA, "Female"), HouseholdIncome = c(NA, 
    NA, NA, NA, "75k-100k"), MaritalStatus = c(NA, NA, NA, NA, "Married"
   ), PresenceofChildren = c(NA, NA, NA, NA, "Yes")), .Names = c("RowNo", 
   "Age", "Gender", "HouseholdIncome", "MaritalStatus", "PresenceofChildren"
   ), class = "data.frame", row.names = c("1", "2", "3", "4", "5"
   ))

First step

Saving RowNo as separate vector (I am not sure why you need this)

  Store2new1 <- Store2$RowNo

Second step

Delete rows with all NA values in Store2 data.frame and store it as Store2df

   Store2df <- Store2[!!rowSums(!is.na(Store2[,-1])),] #Here you already get the new dataset with `RowNo` column

   Store2df
   #RowNo   Age Gender HouseholdIncome MaritalStatus PresenceofChildren
   #1     1  <NA>   Male            <NA>          <NA>               <NA>
   #2     2 45-54 Female            <NA>          <NA>               <NA>
   #5     5 45-54 Female        75k-100k       Married                Yes

Third step

Delete same rows in Store2new1 vector as Store2df data.frame

   Store2new2 <- Store2new1[Store2new1 %in% Store2df$RowNo]
   Store2new1[Store2new1 %in% Store2df$RowNo]
   #[1] 1 2 5

Fourth step

I don't really think the fourth step or third is required unless you want to delete more rows, which is not clear from the post.

akrun
  • 874,273
  • 37
  • 540
  • 662
  • I put the results in an object, but found out another column is added showing the original rows. How can we change code so it doesn't insert the extra row? – Scott Davis Sep 09 '14 at 17:32
  • @Scott Davis Could you update the post with the new info, possibly using `dput` output as I showed for `Store2` and also the expected output as I didn't understand clearly from your description. – akrun Sep 09 '14 at 17:36
  • I just made the edit. I am confused why the new column appeared in the environment and not in the table when I called the head() function. – Scott Davis Sep 09 '14 at 17:56
  • @Scott Davis Could you tell me whether `df2` is `Store2` and your expected output? – akrun Sep 09 '14 at 17:59
  • I made a typo. I am still referring to the table. – Scott Davis Sep 09 '14 at 18:01
  • I have another question. Say I had a column stating all the `UserID`s. In order to carry out the operation of removing rows with all `NA`s, I had to remove the user id's first. There are some users with no information at all. How would I remove the `UserID`s after removing the rows with all `NA` values? I don't want to copy and paste over a thousand rows into an index function. – Scott Davis Sep 19 '14 at 22:34
  • @Scott Davis As I understand it, in the `Store2df` dataset, you got rows 1,2, 5,6,7,8 after removing rows with all `NA's`. Now, I guess, you wanted to remove some more rows? For example, rows 1 and 2 have missing values in all columns except 1 and 2 column. It is a little unclear. Could you show the expected result? Or is it that you wanted to have a sequence of row numbers instead of 1,2, 5,6,.... If that is the case `row.names(Store2df) <- 1:nrow(Store2df)`. – akrun Sep 20 '14 at 04:22
  • Yes, I want to remove 2099 rows from a vector called `RowNo`. Your example is correct, there are some `RowNo`s that have no data. The function shown above `which(rowSums(is.na(Store2df))==ncol(Store2df))` shows which rows have no data AFTER taking out the column `RowNo`. I don't want to create an index, copy/paste every number, and add commas in between. Is there a faster way? – Scott Davis Sep 20 '14 at 19:28
  • @Scott Davis If I understand correct `Row.No` is not a column in your dataset `Store2df`. It is the one that got printed in your Rstudio environment. If you just show your expected result from the above dataset, I can try. Descriptions can get confusing. – akrun Sep 20 '14 at 19:38
  • I see sorry for the confusion! `RowNo` was a column in the beginning I took out. After performing `Store2[!!rowSums(!is.na(Store2)),]` a column called `Row.No` was created that only shows in the Rstudio environment. I want the same rows that were deleted from Store2df to be deleted in the original `RowNo` column. I put in my edit the expected result of what I want for the first 6 columns. – Scott Davis Sep 20 '14 at 19:52
  • I appreciate your help, but `Store2new2` gave a object with no integers. – Scott Davis Sep 20 '14 at 22:27
  • @Scott Davis it is not the case in the example I showed. You need to check `Store2New1` and `Store2df$RowNo` – akrun Sep 21 '14 at 03:48
4

Using the Store2 sample data posted in the answer provided by @akrun

which(apply(Store2, 1, function(x) all(is.na(x))))
#3 4 
#3 4 

Or, similar to akrun's answer:

which(rowSums(!is.na(Store2))==0)
#3 4 
#3 4 
Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
Jota
  • 17,281
  • 7
  • 63
  • 93