1

I hope you can help me with this problem: For my work I have to use R to analyze survey data. The data has a number of columns by which I have/want to group the data and then do some calculations, e.g. How many men or women do work at a certain department? And then calculate the number and percentage for each group. --> at department A work 42 people, whereof 30 women and 12 men, at department B work 70 people, whereof 26 women and 44 men.

I currently use the following code to output the data (using ddply):

percentage_median_per_group_multiple_columns <- function(data, column_name, column_name2){
    library(plyr)
    descriptive <- ddply( data, column_name,
        function(x){ 
            percentage_median_per_group(x, column_name)
            percentage_median_per_group(x, column_name2)
        }
    )
    print(data.frame(descriptive))
}

## give number, percentage and median per group_value in column
percentage_median_per_group <- function(data, column_name3){
    library(plyr)
    descriptive <- ddply( data, column_name3,
        function(x){ 
             c(
                 N <- nrow(x[column_name3]), #number
                 pct <- (N/nrow(data))*100   #percentage
                                             #TODO: median
             )
        }
    )
    return(descriptive)
}
#calculate
percentage_median_per_group_multiple_columns(users_surveys_full_responses, "department", "gender")

Now the data outputs like this:

Department     Sex  N    % per sex
   A           f    30     71,4
               m    12     28,6

   B           f    26     37,1
               m    44     62,9

But, I want the output to look like this, so calculations take place and are printed in each substep:

Department   N    % per department     Sex  N    % per sex
   A        42     37,5                f    30     71,4
                                       m    12     28,6
   B        70     62,5                f    26     37,1
                                       m    44     62,9

Does anyone have a suggestion of how I can do that, if possible even build it dynamic so I can potentially group it by the variables in multiple columns (e.g. department + sex + type of software + ...), but I would be happy if I can have it already like in the example =)

thanks!

EDIT You can use this to generate example data:

n=100

sample_data = data.frame(department=sample(1:20,n,replace=TRUE),     gender=sample(1:2,n,replace=TRUE))
percentage_median_per_group_multiple_columns(sample_data, "department", "gender")

V1 in the output stands for N (number) and V2 for %

MarchingHome
  • 1,184
  • 9
  • 15
PSR
  • 513
  • 6
  • 16
  • 1
    Those function names are the longest I've seen so far :) – talat Nov 13 '14 at 16:20
  • @beginneR Hehe, I know, just to make it clear/readable for now ;-) – PSR Nov 13 '14 at 16:23
  • 3
    @PSR Good goal, but e.g. “calculate number” is not interesting information: it’s purely redundant, and the suffix “value” even more so. These are filler words. Shortening the name to `percentage_median_per_group` preserves all the information and improves readability further. – Konrad Rudolph Nov 13 '14 at 16:30
  • 2
    @KonradRudolph Thanks you are right, edited the names for readability. – PSR Nov 13 '14 at 16:48
  • 1
    Can you provide sample data? ([reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)) – r2evans Nov 13 '14 at 16:49
  • 1
    @r2evans Thanks, yes see the edited post for generating example data and calling the function (with the right column names) – PSR Nov 13 '14 at 17:01

0 Answers0