1

I would like to aggregate a table (tab) by two columns (sequence and program) to get the top row of samplesize (FUN=head).

sq <- c(1,1,1,1,1,1) 
prog<- c('A','A','B','B','C','C') 
ss <- c(47,47,28,28,47,47) 
tab<- data.frame(sq,prog,ss)

Aggregate is giving me an odd result in that if the sample size is the same for a DIFFERENT combination of sequence and program- it omits it.

agg  <- aggregate(cbind(sq,prog) ~ ss, data = tab, FUN=head,1,na.rm=TRUE)

I'm confused why this is occurring and why it is changing the program to a numerical sequence when it is text (A,B,C).

user9084595
  • 85
  • 1
  • 1
  • 5
  • Your aggregation is wrong: `agg <- aggregate(ss ~ sq + prog, data = tab, FUN=head,1,na.rm=TRUE)` – CPak Feb 02 '18 at 19:22

1 Answers1

2

It's because by default, data.frame creates a factor from character columns. You need:

tab <- data.frame(sq, prog, ss, stringsAsFactors = FALSE)

EDIT: I personally find the dplyr package very intuitive. For your result, I'd use:

library(dplyr)
tab %>%
  group_by(sq, prog) %>% 
  filter(row_number() == 1)
Constantinos
  • 1,327
  • 7
  • 17
  • well this answers why it is giving me numbers instead of the letters in the aggregated table. However, the main problem still persists: it omits the records where sample size might be the same for a different combination of sequence and program? – user9084595 Feb 02 '18 at 18:54
  • Ok, I hadn't realised. I've now edited my answer accordingly. – Constantinos Feb 02 '18 at 19:13