0

The data set I have as file Dummy.txt is as follows

A|B|C|D
1|2|1.9|5
2.5|5|53|3
4|48|49|0.4
8|94|495|B6

(please note a text character in 5th row, 4th column)

I would like to obtain the mean of each column (i.e. column A, B, C and D).

The code I am using is as follows:

mydata_1 <- read.delim("Dummy.txt", skipNul = TRUE, sep = "|", header = FALSE, row.names = NULL)

mydata_1 <- as.numeric(as.character(mydata_1))

colMeans(mydata_1, na.rm = TRUE,)

However, this doesn't seem to be working. Any suggestions please?

Sharon
  • 19
  • 4
  • You have column headers, so `header = TRUE` would be better. Does that solve your problem? – KoenV May 09 '18 at 06:37
  • Tried that, still doesn't help sorry. – Sharon May 09 '18 at 06:38
  • and you have a non numeric entry in column 4, row 4: B6 – KoenV May 09 '18 at 06:38
  • Output of `as.numeric(as.character(mydata_1))` is all `NA`s. You cannot directly convert a dataframe to numeric. Use `lapply/sapply` , convert the column into numeric and then take sum. `sapply(mydata_1, function(x) sum(as.numeric(x), na.rm = T))` – Ronak Shah May 09 '18 at 06:39
  • 5th row, 4th column - edited (sorry again) – Sharon May 09 '18 at 06:39
  • @RonakShah: What is x here? Also, is there a comma after function(x)? – Sharon May 09 '18 at 06:41
  • @Sharon `x` is used as an anonymous variable which represents the column and there is no comma after `function(x)` . – Ronak Shah May 09 '18 at 06:43
  • @RonakShah - thank you. Just one last question. In case there are no column headers, x would be a numeric (such as 2 corresponding to 2nd column)? And for column average, I should use mean(as.numeric(x)...) ? – Sharon May 09 '18 at 06:45
  • @Sharon yes, using `mean(as.numeric(x), na.rm = TRUE)` should work in that case. – Ronak Shah May 09 '18 at 06:48
  • [First convert the columns to numeric](https://stackoverflow.com/questions/2288485/how-to-convert-a-data-frame-column-to-numeric-type) and then [take sum of columns](https://stackoverflow.com/questions/29006056/efficiently-sum-across-multiple-columns-in-r). – Ronak Shah May 09 '18 at 06:51
  • @RonakShah: For the fourth column, I am not getting the correct average here (due to the presence of a text character). – Sharon May 09 '18 at 06:53
  • To turn the problem upside down: you might have used the parameter `colClasses` from `read.delim` to indicate that your columns should be integers. This would make the conversion unnecesary. – KoenV May 09 '18 at 06:53

2 Answers2

7

You need to set header = TRUE to have the A|B|C|D row be used for column names, otherwise they are included as values, and all columns are parsed as string columns.

Then, passing stringsAsFactors = FALSE prevents columns D from being turned into a factor, and then the value 'B6' will automatically be turned into an NA when converted to a numeric type.

mydata_1 <- read.delim("Dummy.txt", skipNul = TRUE, sep = "|", header = TRUE, 
                       row.names = NULL, stringsAsFactors = FALSE)

mydata_1[] <- lapply(mydata_1, as.numeric)
#> Warning message:
#> In lapply(mydata_1, as.numeric) : NAs introduced by coercion

colMeans(mydata_1, na.rm = TRUE)
#>       A       B       C       D 
#>   3.875  37.250 149.725   2.800 

The syntax mydata_1[] <- ... makes mydata_1 keep its data frame structure even though a list is being returned on the right-hand side.

janusvm
  • 355
  • 2
  • 10
1

The problem here is that as.numeric(as.character(mydata_1)) returns [1] NA NA NA NA.

My suggestion would be to first go through all columns and coerce the types using sapply(), and then calculate the means of the columns:

library(magrittr)

mydata_1 %>% 
  sapply(., function(col) as.numeric(as.character(col))) %>% 
  colMeans(na.rm = TRUE)

This will return:

      A       B       C       D 
  3.875  37.250 149.725   2.800 

Note: I am using magrittr to make use of the pipe (%>%) operator to chain the operations so you can check the output of every step.

clemens
  • 6,653
  • 2
  • 19
  • 31