Why isn't this R script sorting through the matrix like I think it's supposed to?

Question

I'm new to R so I don't know all the ins and outs of it yet, but I think the code should work. It's supposed to look through a matrix of people with different heights, weights, and income levels read from a .csv file, then remove any rows that have any value as NA or that have any value not within a certain given range (e.g. 4.5 to 6.5 for height). When I run the script some people are removed but there are still people with NA values or values outside of the given ranges, so I don't know if it's removing only a certain amount of the people that don't fit or if it's removing the wrong people completely, or both.

original = read.csv("C:/Users/gsbal/OneDrive/Documents/Quants R Course/HW/A2-C-DirtyData.csv")

nums = 1:nrow(original)
toDelete = 0
deleted = 0
for (i in nums)
{
  na = is.na(original[i, 1]) | is.na(original[i, 2]) | is.na(original[i, 3])
  if (na == T)
  {
    toDelete = i - deleted
    original = original[-toDelete,]
    deleted = deleted + 1
  }
}

nums = 1:nrow(original)
toDelete = 0
deleted = 0
for (i in nums)
{
  height = original[i, 1] < 4.5 | original[i, 1] > 6.5
  if (height == T)
  {
    toDelete = i - deleted
    original = original[-toDelete,]
    deleted = deleted + 1
  }
}

Welcome to Stack Overflow! Help us help you: Provide a [mcve]. In particular, we don't have access to your data. You can [edit] your question to include the output of the R command `dput(original)` to provide your data in a format that is easy for others to put into their R session. (You can use `dput(head(original, n))` for some `n` to provide a manageable subset if `original` is a large dataset). — duckmayr, May 30 '20 at 15:22
That said, an overall comment I'd give you is to just go ahead and work with logical vectors rather than looping one by one. Loops get a bad rap in R, which is often unwarranted, but here I'm suggesting avoiding the loop so that you can avoid having to try to change your indexing to account for rows that you've already removed. Working with vectors, you can just discover all the bad rows, then remove them all at once. — duckmayr, May 30 '20 at 15:24

score 2 · Accepted Answer · answered May 30 '20 at 15:26

One of the great strengths of R is that it works on columns of data, rather than individual values. So you should be able to do all that you want to do without resorting to loops. I recommend tidyverse as a package that will almost always be helpful. It is, in the words of its developers, "opinionated" - and it is - but it's also very good.

Unfortunately, you haven't given us a simple self-contained example, so I can't test my code, but something like this should remove any row of your data with an NA in any column. [See this post if you'd like to know more about simple self-contained examples and reprexes.]

library(tidyverse)

modified <- original %>% drop_na()

How simple was that? You can implement your range checks in a similar way:

modifiedAgain <- modified %>% filter(height < 4.5 | height > 6.5)

Why isn't this R script sorting through the matrix like I think it's supposed to?

1 Answers1