0

I am working with a data frame that has several NA values. I would like to remove the majority of them from the data frame. Is there a way to remove 90% of them or leave 10% by designating one of these percentages?

Thank you in advance for any help!

Joshua Mire
  • 736
  • 1
  • 6
  • 17
aped
  • 27
  • 3
  • 2
    A re you *sure* this is a good idea? I don't see why removing *all* of the NA entries would be bad, but removing *most* of them would be good. What purpose do the remaining 10% serve? Specifically, how do the remaining ones "maintain the integrity of the data that is already there?" And how do you want to decide which 90% of the NA values to excise? – Aaron Montgomery Jun 03 '20 at 16:47
  • Removing `NA` *by-column* suggests some confusion about the data. In a frame, each row is generally an "observation"; by removing `NA`s by column, you are unlikely to preserve the relationship of the nth observation in column 1 and the same nth observation in column 2. Further, if the number of `NA` in each column differs, you might attempt to remove more values from one column than another, which ... completely defeats the premise in `data.frame`s that each column is of identical length. It would help immensely to see sample data and your efforts up until now. – r2evans Jun 03 '20 at 17:40
  • Thank you for the feedback, so how do you recomend me continiue to problem is that some of my observations are missing 80 % of the entire column, should I just fill everything with the mean??? – aped Jun 03 '20 at 18:13
  • Well, why are the `NA` values a problem? What are they preventing you from doing? In general, the best thing to do with `NA` values is to figure out how to live with them, since replacing them with anything changes your data, perhaps drastically and in unhelpful ways. If you're running a function that's choking on the `NA` values, perhaps it has an `na.rm` parameter that causes it to skip the missing values, or something like that. But I am generally very wary of imputing data of any kind unless you're *really, really, really* sure you're on sound theoretical footing to do so. – Aaron Montgomery Jun 03 '20 at 19:00
  • 1
    If some fields are missing 80% of their data, then either (a) this is normal, and "missing" (`NA`) is clearly representative of something and therefore *meaningful* in its own regard; or (b) something is terribly wrong with the data acquisition, data storage, or data-munging processes. I don't really know for sure, though, without more context. (Sample data. Code attempted. Please see https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info.) – r2evans Jun 03 '20 at 19:08
  • is it possible to preserve the current data while running a model and have it just skip over any observation with an na??? – aped Jun 03 '20 at 21:18
  • @aped Maybe, but we don't know what you're trying to do. (Would you like to explain?) At any rate, removing NA values or filling them in with means is probably not a good approach to solving that problem. – Aaron Montgomery Jun 03 '20 at 21:53

0 Answers0