1

I am combining several large datasets in R where missing values are denoted by ".". I want to do a bulk find-and-replace of "." with "NA" across the entire dataset (there are ~35 columns, and several hundred thousand rows). I've tried ifelse statements within individual columns, but the class of the column changes from factor to character in this process. When I convert back to factor the values have changed.

example data.frame:

SHARP_ID YEAR  CAL_DATE JUL_DAY ST_TIME OBS_INIT NOISE
23971_p7 2012 28-Jul-12     210     837   RP_CAW        1
23971_p7 2012  2-Jun-12     154     735   RP_CAW        4
23971_p5 2012 28-Jul-12     210     855   RP_CAW        1
23971_p10 2012 28-Jun-12     180    1012   RP_CAW        3
23971_p10 2012 28-Jul-12     210     813   RP_CAW        1
23971_p2 2012 28-Jun-12     180     856   RP_CAW        .
23971_p2 2012 28-Jun-12     180     856   RP_CAW        2
23971_p2 2012 28-Jul-12     210     921   RP_CAW        1
23971_p5 2012  2-Jun-12     154     753   RP_CAW        .
23971_p5 2012  2-Jun-12     154     753   RP_CAW        .

I have tried using ifelse, lapply, and gsub, however in all cases the mode of the column (in this example, NOISE) changes from factor to character. When I try to switch it back to factor, the values are different. for example:

> levels(d$NOISE)
[1] "." "0" "1" "2" "3" "4"
> class(d$NOISE)
[1] "factor"
> d$NOISE=ifelse(d$NOISE==".",as.factor("NA"),as.factor(d$NOISE))
> class(d$NOISE)
[1] "integer"
> d=RP12[,1:24]
> levels(d$NOISE)
[1] "." "0" "1" "2" "3" "4"
> class(d$NOISE)
[1] "factor"
> d$NOISE=ifelse(d$NOISE==".",as.factor("NA"),as.factor(d$NOISE))
> class(d$NOISE)
[1] "integer"
> d$NOISE=as.factor(d$NOISE)
> class(d$NOISE)
[1] "factor"
> levels(d$NOISE)
[1] "1" "2" "3" "4" "5" "6"

I need to do blanket find/replaces for a lot of values in this dataset, and most of the time they will be the equivalent of cell-specific find and replaces in Excel. These databases are all too big to be handled in Excel, so here I am. I am a newbie to data management in R, so please bear with me, help much appreciated.

Mo Correll
  • 57
  • 1
  • 8
  • 1
    One thing you really need to be aware of is the fact that you are using `"NA"` instead of `NA`. This lets R interpret it as a normal character vector of length 1 - _not_ as the special value for missing values. See for example the difference between `is.na("NA")` and `is.na(NA)`. Also note that you could crcumvent this problem by defining `na.strings = "."` in the call to `read.table` or `read.csv` when reading in your data. – talat Feb 03 '15 at 20:56
  • 2
    `read.*` the data in with `na.strings = c('.', 'NA', etc)` and you won't have this to deal with – rawr Feb 03 '15 at 20:58

1 Answers1

2

You're over-complicating things:

d[d == "."] <- NA

In case it's not clear, what happens here is that R checks whether every value in your dataframe is equal to ".". It then returns a dataframe of the same dimensions as your original, with a TRUE value everywhere the condition is met and False where it is not. You then use this dataframe to indicate which values of the original should be replaced accessed, in this case to be replaced by NA.

This should work regardless of the class; both factor and character vectors will return TRUE and be modified where appropriate, while other classes will return FALSE and be passed over.

Edit: reproducible demonstration

> d <- data.frame(
+   a = c ("A", "B", "."),
+   b = as.character(c("A", "B", ".")),
+   c = c(1,2,3)
+ )
> d
  a b c
1 A A 1
2 B B 2
3 . . 3
> d[d == "."] <- NA
> d
     a    b c
1    A    A 1
2    B    B 2
3 <NA> <NA> 3
Joe
  • 3,831
  • 4
  • 28
  • 44
  • I tried this on my dataset, and it runs with no warnings or errors, however the "." entries are still there, none seem to have been replaced. The same is true if I try it on just one column. – Mo Correll Feb 03 '15 at 21:35
  • @Mo I don't know why that would be without a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). I've just added a brief demonstration to my answer which shows that it works for me on a 'typical' dataframe; your problem is likely related to some peculiarity of your data or code, so I can't help without those. – Joe Feb 03 '15 at 21:57
  • Joe - Ive tried making a data frame for you, but every time I subset my data (I've tried 2 separate databases) down to a size that would be appropriate to pass to you, the code you provided works to replace the erroneous values. – Mo Correll Feb 04 '15 at 14:25
  • Interesting, I'm not sure why that would be the case. If you can upload the entire dataset somewhere maybe I can have a look, though I have no idea what could be going on and it may be beyond my abilities to solve. As a somewhat hacky solution, if sub-setting the df fixes the problem, can you just split it into several pieces for this step? – Joe Feb 04 '15 at 17:29
  • It started working when I'd eliminated a lot of covariates. Now I think I've got it working on the whole dataset, but the "levels" function is still reading that the replaced values are still in there, even when I query the rows with the offending value, 0 rows are returned. I'm at a loss. – Mo Correll Feb 06 '15 at 18:36
  • I really don't think I can help you with the information I have, and a comment thread is not to troubleshoot new issues. If you think this all falls under the purview of your original question then you should edit it to include the new info (and a reproducible example)- this will bump it to the front page so you might get new attention. If it's evolved beyond your original question, you should ask a new one. Good luck! – Joe Feb 06 '15 at 19:33