2

I need to remove the non-numeric values in my data frame. Because I only need numeric values to do quantiles, percentiles, etc. Below is my data.

dataL
 [ reached getOption("max.print") -- omitted 12892 entries ]
648 Levels: *Unknown* .P 001 111110 111199 111219 111310 111331 111335 111336 111339 111419 ... N/A

As you can see there are character values like Unknown, .P, etc. And I need to remove those things to do percentiles, quantiles, etc. This is what I did.

dataL[dataL == "NA" | dataL == "N/A" |dataL == "*Unknown*" |dataL == ".P" |dataL == "NULL"] <- NA
dataS <- na.omit(dataV)

But when I run the dataS it still has the character value Unknown

dataS

678 Levels: *Unknown* 0111 0116 0119 0139 0173 0174 0175 0179 0181 0182 0211 0212 0252 0711 ... 9999
ReKx
  • 996
  • 2
  • 10
  • 23
Bustergun
  • 977
  • 3
  • 11
  • 17
  • 3
    Redo your data entry. Just coerce any numeric column with `colClasses` at the time of `read.table` or `read.csv`. – IRTFM Apr 09 '18 at 07:04

1 Answers1

2

We could avoid this problem while specifying na.strings in the read.csv/read.table

dataL <- read.csv("file.csv", stringsAsFactors = FALSE,
   na.strings = c("NA", "N/A", "Unknown*", "NULL", ".P"))

The problem with the current approach is that these are factor columns and replacing those levels to NA still show the unused levels. So, we need droplevels to remove the unused levels

dataS <- droplevels(na.omit(dataL))
akrun
  • 874,273
  • 37
  • 540
  • 662