colMeans not working in R

Question

The data set I have as file Dummy.txt is as follows

A|B|C|D
1|2|1.9|5
2.5|5|53|3
4|48|49|0.4
8|94|495|B6

(please note a text character in 5th row, 4th column)

I would like to obtain the mean of each column (i.e. column A, B, C and D).

The code I am using is as follows:

mydata_1 <- read.delim("Dummy.txt", skipNul = TRUE, sep = "|", header = FALSE, row.names = NULL)

mydata_1 <- as.numeric(as.character(mydata_1))

colMeans(mydata_1, na.rm = TRUE,)

However, this doesn't seem to be working. Any suggestions please?

You have column headers, so `header = TRUE` would be better. Does that solve your problem? — KoenV, May 09 '18 at 06:37
Output of `as.numeric(as.character(mydata_1))` is all `NA`s. You cannot directly convert a dataframe to numeric. Use `lapply/sapply` , convert the column into numeric and then take sum. `sapply(mydata_1, function(x) sum(as.numeric(x), na.rm = T))` — Ronak Shah, May 09 '18 at 06:39
@RonakShah: What is x here? Also, is there a comma after function(x)? — Sharon, May 09 '18 at 06:41
@Sharon `x` is used as an anonymous variable which represents the column and there is no comma after `function(x)` . — Ronak Shah, May 09 '18 at 06:43
@RonakShah - thank you. Just one last question. In case there are no column headers, x would be a numeric (such as 2 corresponding to 2nd column)? And for column average, I should use mean(as.numeric(x)...) ? — Sharon, May 09 '18 at 06:45
@Sharon yes, using `mean(as.numeric(x), na.rm = TRUE)` should work in that case. — Ronak Shah, May 09 '18 at 06:48
[First convert the columns to numeric](https://stackoverflow.com/questions/2288485/how-to-convert-a-data-frame-column-to-numeric-type) and then [take sum of columns](https://stackoverflow.com/questions/29006056/efficiently-sum-across-multiple-columns-in-r). — Ronak Shah, May 09 '18 at 06:51
@RonakShah: For the fourth column, I am not getting the correct average here (due to the presence of a text character). — Sharon, May 09 '18 at 06:53
To turn the problem upside down: you might have used the parameter `colClasses` from `read.delim` to indicate that your columns should be integers. This would make the conversion unnecesary. — KoenV, May 09 '18 at 06:53

score 7 · Answer 1 · answered May 09 '18 at 06:56

You need to set header = TRUE to have the A|B|C|D row be used for column names, otherwise they are included as values, and all columns are parsed as string columns.

Then, passing stringsAsFactors = FALSE prevents columns D from being turned into a factor, and then the value 'B6' will automatically be turned into an NA when converted to a numeric type.

mydata_1 <- read.delim("Dummy.txt", skipNul = TRUE, sep = "|", header = TRUE, 
                       row.names = NULL, stringsAsFactors = FALSE)

mydata_1[] <- lapply(mydata_1, as.numeric)
#> Warning message:
#> In lapply(mydata_1, as.numeric) : NAs introduced by coercion

colMeans(mydata_1, na.rm = TRUE)
#>       A       B       C       D 
#>   3.875  37.250 149.725   2.800

The syntax mydata_1[] <- ... makes mydata_1 keep its data frame structure even though a list is being returned on the right-hand side.

You've summarized all my thoughts exactly. This is definitely the correct approach IMO — David Arenburg, May 09 '18 at 06:57

clemens · Accepted Answer · 2018-05-09T08:24:05.570

1

The problem here is that as.numeric(as.character(mydata_1)) returns [1] NA NA NA NA.

My suggestion would be to first go through all columns and coerce the types using sapply(), and then calculate the means of the columns:

library(magrittr)

mydata_1 %>% 
  sapply(., function(col) as.numeric(as.character(col))) %>% 
  colMeans(na.rm = TRUE)

This will return:

      A       B       C       D 
  3.875  37.250 149.725   2.800

Note: I am using magrittr to make use of the pipe (%>%) operator to chain the operations so you can check the output of every step.

edited May 09 '18 at 08:24

answered May 09 '18 at 06:45

clemens

6,653
2
19
31

hey, why not `sapply` and leave out the `do.call` part? – Andre Elrico May 09 '18 at 08:16

colMeans not working in R

2 Answers2