Extracting Ethnicity by Study Groups in r when columns are factors

Question

I have a csv file which contains cancer data for two study groups: A and A Follow-up (eg, before and after treatment). The data are presented as follows:

ID           Ethnicity        Study Group    
45A          Caucasian        A  
45B          Caucasian        A - follow up  
68A          Asian            A    
68B          Asian            A - follow up

Both Ethnicity and Study Group are currently factors. I'd like to extract out the total by ethnicity by study group but struggling to find a way forward. Any help welcome.

And how is it a problem that they are factors? – Heroka Aug 05 '15 at 14:25 — Heroka, Aug 05 '15 at 14:25
What is the desired output? – jeremycg Aug 05 '15 at 15:06 — jeremycg, Aug 05 '15 at 15:06
@jeremycg - see below please – LadyoftheWater Aug 05 '15 at 15:56 — LadyoftheWater, Aug 05 '15 at 15:56

jeremycg · Accepted Answer · 2015-08-05T16:14:38.693

1

Using dplyr:

library(dplyr)
pairedAB %>% group_by(Study.Group, Ethnicity) %>%
        summarise(number = n())

edited Aug 05 '15 at 16:14

answered Aug 05 '15 at 14:23

jeremycg

24,657
5
63
74

4

How do you know that this is the expected output?. – akrun Aug 05 '15 at 14:31
Be patient guys. It takes time to understand programming and ask the right questions. Thank you. I will give the above solution a go... – LadyoftheWater Aug 05 '15 at 15:49
My data = pairedAB. Using the above and substituting dat for pairedAB returns this: "Study.Group" "Ethnicity" number 1 Study.Group Ethnicity 64. This isn't what I'm after. I need to split the groups into "A" and "A - follow-up" and then within each return the number of women who fall into each ethnic group. Under Ethnicity, these groups are "Black", "Asian" and "Caucasian". Cheers – LadyoftheWater Aug 05 '15 at 15:52
Just to qualify: both columns are contained in df called "pairedAB" – LadyoftheWater Aug 05 '15 at 15:57
Hi, I have 64 observations so doing that would return a long list that I don't think would be welcome. Ethnicity and Study Group are 2 columns from this data. The patients are listed by row with 64 observations in columns. What I'd like to achieve is pulling out Ethnicity by the 2 study groups to start with, then doing the same for observation of interest (eg, HPV infection - which dramatically changes pre-and post treatment), Does that help? In the original, I just simplified by taking the first 2 columns of interest. Sorry for any confusion. – LadyoftheWater Aug 05 '15 at 16:09
try now - the problem is you have referred to "Study Group" and Study.Group interchangably - we can;t tell which your data has unless you use a dput. See [the reproducible example faq](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – jeremycg Aug 05 '15 at 16:15
You just contributed to cancer research. Many thanks for your interest. – LadyoftheWater Aug 05 '15 at 16:26
Full working code for clarity for anyone else: dput(head(pairedAB)) pairedAB %>% group_by(Study.Group, Ethnicity) %>% summarise(number = n()) – LadyoftheWater Aug 05 '15 at 16:29

PavoDive · Answer 2 · 2015-08-06T15:52:40.043

0

Provided that the dplyr answer by @jeremycg produced the correct output (as the question doesn't have an expected output), here is the data.table alternative:

library(data.table)
pairedAB[,.(number=sum(length(ID))),by=c("Ethnicity","Study.Group")]

edited Aug 06 '15 at 15:52

answered Aug 05 '15 at 17:32

PavoDive

6,322
2
29
55

Thank you. I'll try both. Cheers. – LadyoftheWater Aug 06 '15 at 12:52
You might want to tick the check mark below the voting of the answer you found more useful, that encourages people to answer your further questions ;) – PavoDive Aug 06 '15 at 15:51
Hi, unfortunately the above code didn't work for me. But I would be interested in any further tweaks. Thanks – LadyoftheWater Aug 07 '15 at 16:54

Extracting Ethnicity by Study Groups in r when columns are factors

2 Answers2