2

I have a large database of more than 85,000 values for more than 100 different companies tagged along more than 100 variables. My goal is to identify the descriptive statistics (mean, standard deviation, min max, and the number of values) corresponding to several variables.

The following is a set of information on one given company which I'll call Company F.

Attendance   Number of representatives   Number of Presenters     Company Audience  
29           2                            30                      2
20           3                            30                      4   
30           10                           20                      5
40           20                           10                      5
10           30                           13                      5

What I am looking to do is to have R calculate the descriptive statistics [mean, standard deviation, min, and max] and for each one of these specific columns and export it into Excel in the following manner:

Company F  Average Number of Attendance Standard Deviation of Number of Attendance Min  Number of Attendance Max Number of Attendance and Number of People in Attendance Average of Number of Representatives   Standard Deviation of Number of Representatives Min of Number of Representatives Max Number of Representatives Total Number of Values Average Number of Presenters Standard Deviation Number of Presenters Min Number of Presenters Max Number of Presenters Total Number of Presenters Average Company Audience Standard Deviation Company Audience Min Number of Company Audience Max Number of Company Audience Total Number of Company Audience 

Because this is such a long row, I'll summarize it by saying that I am trying to find the descriptive statistics [mean, standard deviation, min, max, and n] of each one of these columns. These should all correspond to company F.

How I have tried to solve this problem:

I have used the descriptive statistics feature in R to get the dataframe to identify the code for me. To do this I used the psych package:

 library(psych)
 describe(CompanyF$Attendance)
 describe(CompanyF$NumberofRepresentatives)
 describe(CompanyF$Number_of_Presenters
 describe(CompanyF$Company Audience)

From using the package I was able to get dataframes and then go into Excel and construct the rows by hand, inputting the values that I received and omitting any information given by the psych library package that didn't correspond to what I was interested in. The following is an example of the type of information I got from the psych package:

vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
1    1 559 2.02 2.21      1    1.75 1.48   0   9     9 0.78    -0.65 0.09

This process is very time consuming and is open for errors. Upon finishing the work for Company F, I then create a new row in Excel just underneath the one for Company F but this time for another company such as Company G whereupon I continue the process of finding the descriptive statistics [mean, standard deviation, min, max, and n] for each one of these variables of interest (attendance, number of representatives, number of presenters, and company audience).

I have looked for various solutions one of which came from this stack overflow post Export data from R into Excel but I was not able to find an explanation of how to import from R row by row information into Excel as well as how to specify that it identify descriptive statistics that I have listed above.

Ideally, I would have the following output placed into Excel:

Company F  Average Number of Attendance Standard Deviation of Number of Attendance Min  Number of Attendance Max Number of Attendance and Number of People in Attendance Average of Number of Representatives   Standard Deviation of Number of Representatives Min of Number of Representatives Max Number of Representatives Total Number of Values Average Number of Presenters Standard Deviation Number of Presenters Min Number of Presenters Max Number of Presenters Total Number of Presenters Average Company Audience Standard Deviation Company Audience Min Number of Company Audience Max Number of Company Audience Total Number of Company Audience 
Company G  Average Number of Attendance Standard Deviation of Number of Attendance Min  Number of Attendance Max Number of Attendance and Number of People in Attendance Average of Number of Representatives   Standard Deviation of Number of Representatives Min of Number of Representatives Max Number of Representatives Total Number of Values Average Number of Presenters Standard Deviation Number of Presenters Min Number of Presenters Max Number of Presenters Total Number of Presenters Average Company Audience Standard Deviation Company Audience Min Number of Company Audience Max Number of Company Audience Total Number of Company Audience 
Company H  Average Number of Attendance Standard Deviation of Number of Attendance Min  Number of Attendance Max Number of Attendance and Number of People in Attendance Average of Number of Representatives   Standard Deviation of Number of Representatives Min of Number of Representatives Max Number of Representatives Total Number of Values Average Number of Presenters Standard Deviation Number of Presenters Min Number of Presenters Max Number of Presenters Total Number of Presenters Average Company Audience Standard Deviation Company Audience Min Number of Company Audience Max Number of Company Audience Total Number of Company Audience 

and so on.

A raw subset of my data is below:

structure(list(sn = structure(c(2L, 2L, 3L, 5L, 2L, 7L, 1L, 9L, 
1L, 9L, NA, 9L, 1L, 26L, 11L, 9L, 7L, NA, NA, 7L, 17L, 9L, NA, 
21L, 7L, 17L, 7L, 7L, 16L, 7L, 7L, 7L, 7L, 26L, 7L, 6L, 26L, 
22L, NA, NA, 11L, 23L, 23L, 26L, NA, 7L, 23L, 1L, NA, 1L, 7L, 
11L, 12L, 13L, 9L, NA, 15L, NA, 20L, 15L, NA, 17L, 5L, NA, 22L, 
15L, NA, NA, 5L, 8L, 32L, 29L, 23L, 33L, 1L, 23L, 14L, 6L, 7L, 
15L), .Label = c("Broome Street", "Company A", "Company B", "Company BC",
"Company C", "Company CC", "Company D Clinton", "Company DD", 
"Company E", "Company ED BroadCompany", "Company G", "Company H     
BroadCompany", 
"Company I BroadCompany", "Company I Studio", "Company J", "Company K", 
"Company L", "Company M", "Company M BroadCompany", "Company M HS    
 BroadCompany", 
"Company MCC BroadCompany", "Company N", "Company P", "Company Q", 
"Company Q Company N", "Company Q Company ZZ", "Company R - Company ZZ", 
"Company SLab", "Company Z", "Company ZE", "Company ZED", "Company ZEQ", 
"Company ZZ", "Company ZZQ", "Company ZZQ Company N"), class = "factor"), 
earn_tot = c(21.85, 20.8, NA, 8.16, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, 7.16, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, 43.32, NA, 30.48, NA, NA, 34.9, NA, NA, NA, NA, NA, 25.82, 
40.75, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0, NA, NA, NA, 
30, NA, NA, NA, NA, NA, NA, 39.1, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, 52.29, 44.32, NA, 7, 38.32, 0, NA, NA, 8.25, 
NA, NA), earn_and_current_tot = c(29.43, 20.8, NA, 8.16, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, 7.16, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, 49.9, NA, 37.56, NA, NA, 41.98, 
NA, NA, NA, NA, NA, 37.32, 49, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, 0, NA, NA, NA, 37, NA, NA, NA, NA, NA, NA, 47.68, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 57.29, 48.48, NA, 
7, 45.9, 0, NA, NA, 15.75, NA, NA), pass_99 = c(0L, 0L, NA, 
NA, NA, NA, 1L, NA, NA, NA, NA, 5L, NA, 0L, NA, 5L, NA, NA, 
NA, 0L, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, 0L, NA, NA, NA, NA, 5L, NA, NA, NA, NA, 4L, 0L, 
NA, NA, NA, 4L, 4L, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, 
NA, 1L, NA, NA, NA, NA, 1L, NA, NA, 0L, 4L, 0L, NA, NA, 0L, 
NA, NA), pass_65 = c(0L, 0L, 5L, 0L, 6L, NA, 0L, 5L, NA, 
5L, NA, 6L, NA, 0L, 5L, 2L, NA, NA, NA, 0L, 5L, 5L, NA, NA, 
NA, 0L, NA, 1L, 4L, 7L, 5L, 5L, 7L, 0L, 5L, NA, 0L, 1L, NA, 
NA, NA, 2L, 0L, 6L, NA, 8L, 2L, 0L, NA, 4L, 0L, 1L, 3L, NA, 
NA, NA, NA, NA, 4L, 0L, NA, 5L, 7L, NA, 0L, NA, NA, NA, 5L, 
0L, 5L, 4L, 0L, 2L, 0L, 0L, 7L, 0L, NA, 5L)), .Names = c("sn", 
"earn_tot", "earn_and_current_tot", "pass_99", "pass_65"), row.names = c(NA, 
80L), class = "data.frame")

There are four subset columns that are of most importance. These columns are "earn_tot", "earn_and_current_tot", "pass_99", and "pass_65". There are numerous companies listed here which have been anonymized. There are roughly 100 companies I am working with. There are numerous company names under the column entitled "sn". The name of the entire subsetted dataset is called Subset.MergedEx.So.

I apologize for not putting up a good reproducible example. Thank you for your patience. I've been reading how to construct one and have used the following code: dput((head(Subset.MergedEx.SO, 80)))

Community
  • 1
  • 1
Dleightful
  • 107
  • 2
  • 13
  • Maybe [this can help](http://stackoverflow.com/questions/35339532/comparing-groups-in-data-table-columns) – Jaap Feb 16 '16 at 14:59
  • Thank you for the pointer. I am not sure exactly how to implement the code for my own specific purposes. In particular, I may have missed information on how to export it row by row into Excel. – Dleightful Feb 16 '16 at 15:14
  • I've added an answer. HTH. For future posts please include a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610). This makes it easier for others to help you. – Jaap Feb 16 '16 at 15:18

2 Answers2

1

What you could do is melt your data into long format and then transform it back to wide format with multiple aggregation functions:

library(data.table)
dat.new <- dcast(melt(dat, id="company"),
                 company ~ variable, 
                 fun = list(mean,sd), 
                 value.var = "value")

this gives:

> dat.new
   company value_mean_attendance value_mean_presenters value_mean_audience value_sd_attendance value_sd_presenters value_sd_audience
1:       A                   8.0                  24.8                60.6            1.870829            4.207137          7.668116
2:       B                   8.2                  23.8                64.2            2.489980            2.387467          2.049390

Now you can write that to an excel-file with for example the WriteXLS package:

library(WriteXLS)
WriteXLS("dat.new","companies.xls")

Because you want to calcutate many statistics for each company, you might want to consider writing the summary statistics for each ompany to a separate sheet in your excel-file.

Again, you transform your data into long format with melt, then you summarise it with lapply(.SD, function(x) list(average = mean(x), sdev = sd(x)))$value for each company and each variable. Split the resulting data.table by company in a list of data.tables. Finally write that list to an excel-file:

dat.new <- melt(dat, id="company")[, lapply(.SD, function(x) list(average = mean(x), sdev = sd(x)))$value, 
                                    .(company,variable)]

company.list <- split(dat.new, dat.new$company)

WriteXLS(company.list,"companies.xls")

Now you have an excel-file with a separate tab for each company.


Used data:

set.seed(21)
dat <- data.table(company = rep(c("A","B"), each = 5),
                  attendance = sample(5:10,10,TRUE),
                  presenters = sample(20:30,10,TRUE),
                  audience = sample(50:70,10,TRUE))
Jaap
  • 81,064
  • 34
  • 182
  • 193
1

This is might not be an optimal solution, but it only used the base and the psych package.

Here is the data

df <- data.frame(company = rep(c("A","B", "C","D"), each = 5),
              attendance = sample(5:10,20,TRUE),
              representatives = sample(2:30,20,TRUE),
              presenters = sample(20:30,20,TRUE),
              audience = sample(50:70,20,TRUE))

I wrote a function to get values you needed. I assume you only have 5 categories of information: company name, attendance, representatives, presenters, audience.

    get.values<-function(x){
    require(psych)
    info<-describeBy(x[,2:5], group = x[,1])
    n.companies<-length(levels(df[,1]))
    n<-list()
    mean<-list()
    sd<-list()
    min<-list()
    max<-list()
    for(i in 1:n.companies){
      n[[i]]<-info[[i]][,2]
      mean[[i]]<-info[[i]][,3]
      sd[[i]]<-info[[i]][,4]
      min[[i]]<-info[[i]][,8]
      max[[i]]<-info[[i]][,9]
    }
  l<-Map(c, mean, sd, min, max, n)
  valuedf<-do.call(rbind, l)
return(valuedf)
}

I also wrote a function to generate the column names you want, you can name them to whatever you want:

get.names<-function(x){
      require(psych)
      names<-rownames(describe(x[,2:5]))
      avg<-character()
      sd<-character()
      min<-character()
      max<-character()
      total<-character()
  for(i in 1:length(names)){
      avg[i]<-paste("average number of", names[i])
      sd[i]<-paste("standard deviation of", names[i])
      min[i]<-paste("min number of", names[i])
      max[i]<-paste("max number of", names[i])
      total[i]<-paste("total number of", names[i])
  }
  cnames<-c(avg,sd,min,max,total)
return(cnames)
}

Combine the values and names into a new dataframe:

output<-get.values(df)
col.names<-get.names(df)
colnames(output)<-col.names
rownames(output)<-levels(df[,1]) 

Export to excel:

library(xlsx)
write.xlsx(output, "descriptives.xlsx")
fhlgood
  • 479
  • 4
  • 9
  • This is perfect. I have the output for the first company in the excel file marked "descriptives". How do I get the second row to show up right underneath it in the same spreadsheet marked "descriptive.xlsx". My fear is that if I run it agian, it will delete the first row and put in a new set of values again. – Dleightful Feb 16 '16 at 17:52
  • I edited the codes and created another function to get the values you need. This should be applied to a bigger general dataframe which stores information for multiple companies. – fhlgood Feb 16 '16 at 19:08
  • I am having trouble implementing the code although it's almost certainly a problem on my end: df <- mergedex(mergedex$sn(c("AEL","ADL", "BKL","CDL", "QL"...)), mergedex$earn_and_current_tot mergededex$earn_tot, mergededex$total_attend mergedex$total_attend_rep)) mergedex is the 39,000 observation file with 251 variables. One of the variables is "sn" which tells the company name (all 94 of them I think). I only chose to write down five of them above "AEL", "ADL", "BKL", "CDL" and "QL" with an ellipsis to show others – Dleightful Feb 16 '16 at 19:46
  • can you provide a small subset of your raw data? It is kind hard to navigate. – fhlgood Feb 16 '16 at 19:51
  • Edit complete. The updated raw data is above in the original question. – Dleightful Feb 16 '16 at 21:14