I have a data.frame
with 15,000 observations of 34 ordinal and NA
variables. I am performing clustering for a market segmentation study and need the rows with only NAs
removed. After taking out the userID I got an error message saying to omit 2099 rows with only NAs
before clustering.
I found a link for removing rows with all NA
values, but I need to identify which of the 2099 rows have all NA values. Here the link for the discussion removing rows with all NA
values: Remove Rows with NAs in data.frame
Here's a sample of the first five observations from six variables:
> head(Store2df, n=5)
RowNo Age Gender HouseholdIncome MaritalStatus PresenceofChildren
1 1 <NA> Male <NA> <NA> <NA>
2 2 45-54 Female <NA> <NA> <NA>
3 3 <NA> <NA> <NA> <NA> <NA>
4 4 <NA> <NA> <NA> <NA> <NA>
5 5 45-54 Female 75k-100k Married Yes
#Making a vector
> Vector1 <- Store2df$RowNo
#Taking out RowNo column
> Store2df$RowNo <- NULL
EDIT: I put the results in a object, but found that the code made an extra column. Clicking in RStudio's environment, an extra column called row.names was created labeling each row with the original row name. A couple thousand rows were deleted and the new column labeled the new rows with the old row number. However when looking at the head of the new object, I did not see the row label. Why does the row.names label show in the environment, but not when I view the head?
#Remove all rows with only NA values
> Store2df <- Store2[!!rowSums(!is.na(Store2)),]
#View head of store2df
> head(Store2df)
Age Gender HouseholdIncome MaritalStatus PresenceofChildren
1 <NA> Male <NA> <NA> <NA>
2 45-54 Female <NA> <NA> <NA>
5 45-54 Female 75k-100k Married Yes
6 25-34 Male 75k-100k Married No
7 35-44 Female 125k-150k Married Yes
8 55-64 Male 75k-100k Married No
EDIT 2: I put in the row number/userID column to keep track of the number of users. To perform the operation for removing all NA
s, I took out the first column. Now I need to keep track of the users I removed. I have a list of over 2000 rows that had all NA
values, I don't want to create an index manually putting in each row.
Question: How do I remove the emails that the missing data corresponded to?
> #First six rows of the column RowNo
> head(Store2df$RowNo)
[1] 1 2 3 4 5 6
I want 2099 rows deleted in the Store2df data.frame
with the RowNo
included. Here's the script identifying which rows are all empty in the Store2df data.frame
without RowNo
.
> which(rowSums(is.na(Store2df))==ncol(Store2df))
Showing the first 6 rows, row number 3 and 4 are deleted.
> head(Store2df$RowNo)
[1] 1 2 5 6 7 8
There are 4 steps I want to complete:
1) Take out RowNo
column in Store2df data.frame
and save as separate vector
2) Delete rows with all NA
values in Store2df data.frame
3) Delete same rows in Store2new1 vector
as Store2df data.frame
4) Combine vector
and data.frame
with vector
matching the data.frame