1

I am working with dc.forehead, which has 1014 rows and 33 columns. The rows are in triplicate, however, and I need to find the column mean for each set of three while keeping their categories.

    ID      Ancestry  X400   X410   X420   etc.
    140819  African   4.550  4.590  4.710
    140819  African   4.310  4.290  4.440
    140819  African   4.420  4.490  4.690
    140822  African   4.190  4.040  3.630
    140822  African   3.591  3.360  3.860
    140822  African   3.890  3.860  3.420
    140844  S. Asian  4.140  3.290  3.880
    140844  S. Asian  3.370  3.720  4.150
    140844  S. Asian  3.260  4.080  3.960
    etc.

The result I would like would be like this:

    ID      Ancestry  X400  X410  X420 etc.
    140819  African   4.43  4.46  4.61
    140822  African   3.89  3.75  3.63
    140844  S. Asian  3.59  3.70  4.00 
    etc.

I've tried the following, but I lost the categories and it deleted one of the ID's and replaced it with #REF

    aggregate(dc.forehead[,3:33], by = list(dc.forehead$ID), FUN = mean)

Any assistance would be really great: I'm new to R and in way over my head!

Thanks

J. White
  • 13
  • 2
  • I used your method on the data showed. I couldn't get `#REF`, but it removed the Ancestry column as it was not in the grouping variable. You need to show an example that create the problem rather than one that works perfectly. – akrun Sep 06 '15 at 03:22
  • 2
    R would *never* return `#REF` by default- it isn't Excel. You've read them from Excel and just didn't notice them because data was too big. You should first check your excel file and see why are you getting those `#REF`s and if you are not losing valuable information. Then, check @akruns answer on how to read the data correctly. Regarding losing `ID`s, it isn't very clear and could happen because of `NA`s present in the data. You should probably provide an example data when `ID`s are being lost after aggregation in order for us to help you (or try the suggestions in akruns answer). – David Arenburg Sep 06 '15 at 06:34
  • 1
    As @akrun said: it's important to give example data that recreates the problem. For a guideline on how to do that, you can read the FAQ about how to produce a [minimal reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) – Jaap Sep 06 '15 at 06:57

2 Answers2

2

You can use the formula method of aggregate from base R. Use 'ID' and 'Ancestry' as the grouping variable. One reason why you may have lost an 'ID' would be that the dataset may have some NA elements in any column for that particular 'ID'. By default, the na.action=na.omit, so it will remove the entire row. We can change the default to na.action=NULL and it should work now.

aggregate(.~ ID + Ancestry, dc.forehead, FUN=mean, na.rm=TRUE, na.action=NULL)
#     ID Ancestry     X400     X410     X420
#1 140819  African 4.426667 4.456667 4.613333
#2 140822  African 3.890333 3.753333 3.636667
#3 140844 S. Asian 3.590000 3.696667 3.996667

Or we can use data.table. We convert the 'data.frame' to 'data.table' (setDT(dc.forehead)), group by 'ID', and 'Ancestry', loop (lapply(.SD, ..) through the columns to get the mean.

library(data.table)
setDT(dc.forehead)[, lapply(.SD, mean, na.rm=TRUE), .(ID, Ancestry)]
#       ID Ancestry     X400     X410     X420
#1: 140819  African 4.426667 4.456667 4.613333
#2: 140822  African 3.890333 3.753333 3.636667
#3: 140844 S. Asian 3.590000 3.696667 3.996667

EDIT: But, it is still not clear (based on the example) the reason you got #REF. It could be that the original excel sheet may have some elements as #REF. One way to rectify this is to read the data using na.strings=#REF to return those elements as NA i.e.

dc.forehead <- read.csv('yourfile.csv', na.strings='#REF', 
                        stringsAsFactors=FALSE)

data

dc.forehead <- structure(list(ID = c(140819L, 140819L, 140819L, 140822L, 
140822L, 
140822L, 140844L, 140844L, 140844L), Ancestry = c("African", 
"African", "African", "African", "African", "African", "S. Asian", 
"S. Asian", "S. Asian"), X400 = c(4.55, 4.31, 4.42, 4.19, 3.591, 
3.89, 4.14, 3.37, 3.26), X410 = c(4.59, 4.29, 4.49, 4.04, 3.36, 
3.86, 3.29, 3.72, 4.08), X420 = c(4.71, 4.44, 4.69, 3.63, 3.86, 
3.42, 3.88, 4.15, 3.96)), .Names = c("ID", "Ancestry", "X400", 
"X410", "X420"), class = "data.frame", row.names = c(NA, -9L))
akrun
  • 874,273
  • 37
  • 540
  • 662
1

Here's a method using the dplyr package.

library(dplyr)

dc.forehead %>% group_by(ID, Ancestry) %>%
  summarise_each(funs(mean)) 

      ID Ancestry     X400     X410     X420
1 140819  African 4.426667 4.456667 4.613333
2 140822  African 3.890333 3.753333 3.636667
3 140844 S. Asian 3.590000 3.696667 3.996667
eipi10
  • 91,525
  • 24
  • 209
  • 285