1

I have a csv file which contains cancer data for two study groups: A and A Follow-up (eg, before and after treatment). The data are presented as follows:

ID           Ethnicity        Study Group    
45A          Caucasian        A  
45B          Caucasian        A - follow up  
68A          Asian            A    
68B          Asian            A - follow up 

Both Ethnicity and Study Group are currently factors. I'd like to extract out the total by ethnicity by study group but struggling to find a way forward. Any help welcome.

scoa
  • 19,359
  • 5
  • 65
  • 80

2 Answers2

1

Using dplyr:

library(dplyr)
pairedAB %>% group_by(Study.Group, Ethnicity) %>%
        summarise(number = n()) 
jeremycg
  • 24,657
  • 5
  • 63
  • 74
  • 4
    How do you know that this is the expected output?. – akrun Aug 05 '15 at 14:31
  • Be patient guys. It takes time to understand programming and ask the right questions. Thank you. I will give the above solution a go... – LadyoftheWater Aug 05 '15 at 15:49
  • My data = pairedAB. Using the above and substituting dat for pairedAB returns this: "Study.Group" "Ethnicity" number 1 Study.Group Ethnicity 64. This isn't what I'm after. I need to split the groups into "A" and "A - follow-up" and then within each return the number of women who fall into each ethnic group. Under Ethnicity, these groups are "Black", "Asian" and "Caucasian". Cheers – LadyoftheWater Aug 05 '15 at 15:52
  • Just to qualify: both columns are contained in df called "pairedAB" – LadyoftheWater Aug 05 '15 at 15:57
  • Hi, I have 64 observations so doing that would return a long list that I don't think would be welcome. Ethnicity and Study Group are 2 columns from this data. The patients are listed by row with 64 observations in columns. What I'd like to achieve is pulling out Ethnicity by the 2 study groups to start with, then doing the same for observation of interest (eg, HPV infection - which dramatically changes pre-and post treatment), Does that help? In the original, I just simplified by taking the first 2 columns of interest. Sorry for any confusion. – LadyoftheWater Aug 05 '15 at 16:09
  • try now - the problem is you have referred to "Study Group" and Study.Group interchangably - we can;t tell which your data has unless you use a dput. See [the reproducible example faq](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – jeremycg Aug 05 '15 at 16:15
  • You just contributed to cancer research. Many thanks for your interest. – LadyoftheWater Aug 05 '15 at 16:26
  • Full working code for clarity for anyone else: dput(head(pairedAB)) pairedAB %>% group_by(Study.Group, Ethnicity) %>% summarise(number = n()) – LadyoftheWater Aug 05 '15 at 16:29
0

Provided that the dplyr answer by @jeremycg produced the correct output (as the question doesn't have an expected output), here is the data.table alternative:

library(data.table)
pairedAB[,.(number=sum(length(ID))),by=c("Ethnicity","Study.Group")]
PavoDive
  • 6,322
  • 2
  • 29
  • 55