27

I have read in some lengthy data with read.csv(), and to my surprise the data is coming out as factors rather than numbers, so I'm guessing there must be at least one non-numeric item in the data. How can I find where these items are?

For example, if I have the following data frame:

df <- data.frame(c(1,2,3,4,"five",6,7,8,"nine",10))

I would like to know that rows 5 and 9 have non-numeric data. How would I do that?

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
stackoverflowuser2010
  • 38,621
  • 48
  • 169
  • 217

2 Answers2

37
df <- data.frame(x = c(1,2,3,4,"five",6,7,8,"nine",10))

The trick is knowing that converting to numeric via as.numeric(as.character(.)) will convert non-numbers to NA.

which(is.na(as.numeric(as.character(df[[1]]))))
## 5 9

(just using as.numeric(df[[1]]) doesn't work - it just drops the levels leaving the numeric codes).

You might choose to suppress the warnings:

which.nonnum <- function(x) {
   which(is.na(suppressWarnings(as.numeric(as.character(x)))))
}
which.nonnum(df[[1]])

To be more careful, you should also check that the values weren't NA before conversion:

which.nonnum <- function(x) {
   badNum <- is.na(suppressWarnings(as.numeric(as.character(x))))
   which(badNum & !is.na(x))
}

lapply(df, which.nonnum) will report 'bad' values for all columns of the data frame.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • 3
    Why can't you use `is.numeric()`? – rrs Jan 18 '14 at 00:08
  • 1
    because `is.numeric()` applied to a factor simply drops the levels. – Ben Bolker Jan 18 '14 at 15:17
  • The answer helps me to find inconsistence notations of missing values in source data. I have data where the missing values were represented by "-" or "x". For example, `df <- data.frame(value = c(1,2,3,4,"-",6,7,8,"x",10)); df %>% filter(value %>% as.numeric() %>% is.na()) %>% count(value) ` can find what are the non numeric data in the column. – microbe Aug 10 '21 at 15:27
  • If the dataframe has multiple columns, can it be adapted to detect non-numeric and if found, return the row, column of their locations? – Denis Cousineau Dec 30 '21 at 17:28
  • Sure (trivially, `lapply(df, which.nonnum)`). If you like you could ask that as a new question, linking back to this one. – Ben Bolker Dec 30 '21 at 17:44
9

An alternative could be to check which entries in the vector contain any characters other than a number:

df <- data.frame(c(1,2,3,4,"five",-6,7.1,-8.059,"nine",10))
which(!grepl('^-?(0|[1-9][0-9]*)(\\.[0-9]+)?$',df[[1]]))
## 5 9 
Florian
  • 24,425
  • 4
  • 49
  • 80
  • 2
    This caret in the regex should be inside the character class. Also, you might consider changing the regex to `[^0-9.]` to allow for decimals. – Ashish Jun 11 '21 at 00:04
  • and `[^0-9.-]` for minus signs? (I'm not quite sure how regex distinguishes `-` as character from `-` as range separator) – Ben Bolker Aug 27 '22 at 20:00
  • Thanbks @Ashish and Ben. I think checking for decimals and negative numbers is a good idea, however the resulting regex Ben proposed will also match invalid elements. So the answer now includes a regex string from https://stackoverflow.com/a/39399503/8037249 – Florian Aug 29 '22 at 07:46