0

I am attempting to add various column together with rowSums but I am having some issues. Here is a list of column names:

colnames(No_Low_No_Intergenic_snpeff)

"CHROM" "POS"   "REF"   "ALT"   "QUAL"  "ANN.ALLELE"    "ANN.EFFECT"
"ANN.IMPACT"    "ANN.GENE"  "ANN.GENEID"    "ANN.FEATURE"   "ANN.FEATUREID"
"ANN.HGVS_C"    "ANN.HGVS_P"    "ANN.ERRORS"    "GEN.C02141.GT" "GEN.C00611.GT"
"GEN.C00633.GT" "GEN.C00634.GT" "GEN.C00644.GT" "GEN.C00647.GT" "GEN.C00648.GT"
"GEN.C00649.GT" "GEN.C00650.GT" "GEN.C00653.GT" "GEN.C00655.GT" "GEN.C00656.GT"
"GEN.C00657.GT" "GEN.C00659.GT" "GEN.C00682.GT" "GEN.C00705.GT" "GEN.C00707.GT"
"GEN.C00720.GT" "GEN.C00783.GT" "GEN.C01431.GT" "GEN.C01944.GT" "GEN.C01943.GT"
"GEN.C01403.GT" "GEN.C01158.GT" "GEN.C01157.GT" "GEN.C01156.GT" "GEN.C01033.GT"
"GEN.C00736.GT" "GEN.C00639.GT" "GEN.C99686.GT"

All of the columns that I am working with are labled GEN.Cxxxxx.GT and all the values in those column range from 0-2. I am trying to sum columns 20:29 and column 45 and then put the values in a new column called controls:

No_Low_No_Intergenic_snpeff.scores$controls <- rowSums(No_Low_No_Intergenic_snpeff.scores[,20:29,45])

but when I try running that command I get the following error:

Error in rowSums(No_Low_No_Intergenic_snpeff.scores[, 20:29, 45]) : 'x' must be numeric

Data

str(No_Low_No_Intergenic_snpeff.scores)

'data.frame':   1000 obs. of 11 variables:
$ GEN.C00644.GT: Factor w/ 3 levels "0","1","2": 3 1 1 3 3 3 2 1 3 1 ...
$ GEN.C00647.GT: Factor w/ 3 levels "0","1","2": 3 1 3 3 2 2 2 1 2 1 ...
$ GEN.C00648.GT: Factor w/ 3 levels "0","1","2": 3 1 1 3 3 3 1 1 2 1 ...
$ GEN.C00649.GT: Factor w/ 3 levels "0","1","2": 3 1 1 3 2 2 2 1 2 1 ...
...
divibisan
  • 11,659
  • 11
  • 40
  • 58
neuron
  • 1,949
  • 1
  • 15
  • 30
  • 3
    You have an error: `rowSums(No_Low_No_Intergenic_snpeff.scores[,c(20:29,45)])` need to wrap `20:29, 45` in `c()` – emilliman5 Jun 25 '18 at 18:30
  • That didn't do it either `No_Low_No_Intergenic_snpeff.scores$controls <- rowSums(No_Low_No_Intergenic_snpeff.scores[,c(20:29,45)])` `Error in rowSums(No_Low_No_Intergenic_snpeff.scores[, c(20:29, 45)]) : 'x' must be numeric` – neuron Jun 25 '18 at 18:32
  • 1
    Are you sure that all the values in those columns are `numeric` as opposed to `factor` or `character` values containing numbers? Providing the output of `str(No_Low_No_Intergenic_snpeff.scores[, c(20:29, 45)])` might help – divibisan Jun 25 '18 at 18:40
  • Please post a sample of your actual data, not just the names of columns. Otherwise we're just guessing as to what the typing issue is. But as @emilliman5 said, proper indexing in R takes the form `data[, c(1, 3, 5:7)]`, not `data[, 1, 3, 5:7]` – camille Jun 25 '18 at 18:42
  • 1
    @Brian I've added your `str` output to the question. In the future, additional information should be added as edits to the question, not as comments. For posting data, use the `dput` function. You should read this page to find out how to make a great question that gets fast and good answers: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – divibisan Jun 25 '18 at 18:58

1 Answers1

6

You're getting this error because the values are not numeric. Look at your output from str:

GEN.C00650.GT: Factor w/ 3 levels "0","1","2": 3 1 3 3 3 3 1 1 3 1 ... 

These are class factor, not class numeric. To work with them as numbers, you need to convert them to numbers using as.numeric

If you can import your data again:

If you can import your data from the file again, you can do so with the stringsAsFactors = FALSE argument. You should almost always use this argument, since without it, all strings (and most numbers, as you see here) will be converted in to factors creating all kinds of annoying problems until you change them back.

As of R 4.0.0, this is no longer necessary, as the default value of stringsAsFactors has been changed to FALSE. This will hopefully make this common mistake a thing of the past

Otherwise, to change from a Factor back to a Number:

Base R

The simplest way to do this is to use sapply:

rowSums(sapply(No_Low_No_Intergenic_snpeff.scores[, c(20:29, 45)],
               function(x) as.numeric(as.character(x))))

This subsets your data.frame, applies the as.numeric function to each row, and then calculates rowSums.

tidyverse

You can also use the mutate_if function from dplyr to convert all factor variables to numeric.

library(dplyr)

No_Low_No_Intergenic_snpeff.scores <- No_Low_No_Intergenic_snpeff.scores %>%
    mutate_if(is.factor, ~as.numeric(as.character(.)))

rowSums(No_Low_No_Intergenic_snpeff.scores[, c(20:29, 45)])

Alternately, you could use mutate_at to select columns by position or name. Read ?select to see all the different way you can select columns. You can even use a regular expression with matches, as below:

No_Low_No_Intergenic_snpeff.scores <- No_Low_No_Intergenic_snpeff.scores %>%
    mutate_at(vars(matches('GEN.C\\d{5}.GT')), funs(as.numeric))

This applies the function as.numeric to all columns whose names match the regular expression GEN.C\\d{5}.GT, where \\d{5} represents 5 numeric digits.

divibisan
  • 11,659
  • 11
  • 40
  • 58
  • hmmm this makes a lot of sense. Let me give a try real quick – neuron Jun 25 '18 at 18:57
  • Is there a way to have dyplyr only change select columns? Right now it is change all my data into as.numeric – neuron Jun 25 '18 at 19:09
  • 1
    @Brian yes, either `mutate_at` to specify columns by name or position, or `mutate_if` to specify columns based on a predicate, as shown above – camille Jun 25 '18 at 19:14