0

I am a total R novice, and I struggle with some of the terminology, among other things. But my advisor wants me to streamline what is now a very tedious routine in his research.

Our data are divided into 2 cities, one with 5 people and one with 4, and each speaker with 110 to 112 data points per variable, with 21 variables total. (So in Excel there's one column for city, one for person, and one for each of the 21 variables.) We are just trying to describe the data in terms of mean, median, stdev, excess kurtosis, and skewness. We are also using shapiro.test.

Right now, we define an object for each person and run each function on one column in that object, but it takes too long. How can I get a test to run over each variable's column for just one speaker? I've read about the apply family and about for loops but I can't seem to get them to work for me--I'm probably lacking the terminology.

If it helps, the cities are labeled Erie and Rice, and the speakers are just Erie1, Erie2, Rice 1, etc.

Thank you!

P.S. If possible, I'd really appreciate knowing how to get results that are copy-and-paste-able into Excel, because we're still copying and pasting individual function results.

  • 1
    Please study this [FAQ](http://stackoverflow.com/a/5963610/1412059) and follow the advice there to improve your question. – Roland Sep 29 '16 at 14:52
  • Hi Mellsworth, could you provide a sample of your data.. – ArunK Sep 29 '16 at 14:53
  • Please ask only one question per post. If you have multiple ones you can simply ask multiple questions. In regard to your excel copy-pasting, there are several options: 1) stop using excel (:)), 2) write your results to csv, 3) use packages that can directly interact with excel such as `openxlsx`. – Paul Hiemstra Sep 29 '16 at 15:18

2 Answers2

1

The function you are looking for is maybe

tapply

Or

aggregate

Something like

DF=data.frame(Cities,Speakers,Var1,Var2,...,Varn)

This is your data.frame, Var1 to Varn are the 20 variables you are looking for.

I will work the shapiro.test, that could be the more complex:

T = aggregate(DF$Var1,by=list(DF$Cities),FUN=shapiro.test)

You could get the shapiro test by combination of City+Speaker

T = aggregate(DF$Var1,by=list(DF$Cities,DF$Speaker),FUN=shapiro.test)

Try it and tell us!

0

The following code creates a dataset I believe looks like yours:

library(dplyr)
nobs = 2 * 5 * 110
dat = data.frame(city = sample(c('Erie', 'Rice'), nobs, replace = TRUE)) %>%
  mutate(speaker = paste0(city, sample(1:5, nobs, replace = TRUE))) %>%
  arrange(city, speaker)
data_matrix = matrix(runif(21 * nobs), nobs, 21)
colnames(data_matrix) = sprintf('Var%d', 1:21)
dat = as.data.frame(cbind(dat, data_matrix))

Next we can use dplyr to split up the data per unique city/speaker combination. We use the convenience function summarise_each to apply multiple functions at once:

dat %>% group_by(city, speaker) %>% summarise_each(funs(mean, sd, median), Var1:Var21)
Source: local data frame [10 x 65]
Groups: city [?]

     city speaker Var1_mean Var2_mean Var3_mean Var4_mean Var5_mean Var6_mean
   (fctr)   (chr)     (dbl)     (dbl)     (dbl)     (dbl)     (dbl)     (dbl)
1    Erie   Erie1 0.5028917 0.5069724 0.4720252 0.5462675 0.5021429 0.5134384
2    Erie   Erie2 0.5378896 0.5151194 0.5429039 0.5159513 0.4622817 0.5328961
3    Erie   Erie3 0.4767338 0.4752459 0.5210605 0.4467936 0.4967070 0.4934170
4    Erie   Erie4 0.4752356 0.5497244 0.5010823 0.4944027 0.5000894 0.4926613
5    Erie   Erie5 0.5187913 0.5090330 0.4960665 0.5002147 0.4679352 0.5181322
6    Rice   Rice1 0.5237725 0.4987702 0.4989190 0.5655607 0.5295775 0.5155883
7    Rice   Rice2 0.5043830 0.4851659 0.5363700 0.5089221 0.5155034 0.5116563
8    Rice   Rice3 0.4701997 0.4877534 0.5037869 0.5250760 0.4662257 0.5158385
9    Rice   Rice4 0.4920601 0.5390394 0.5033235 0.5214137 0.4796411 0.5298566
10   Rice   Rice5 0.4922858 0.4702580 0.4977153 0.4571975 0.5128249 0.4979027
Variables not shown: Var7_mean (dbl), Var8_mean (dbl), Var9_mean (dbl),
  Var10_mean (dbl), Var11_mean (dbl), Var12_mean (dbl), Var13_mean (dbl),
  Var14_mean (dbl), Var15_mean (dbl), Var16_mean (dbl), Var17_mean (dbl),
  Var18_mean (dbl), Var19_mean (dbl), Var20_mean (dbl), Var21_mean (dbl),
  Var1_sd (dbl), Var2_sd (dbl), Var3_sd (dbl), Var4_sd (dbl), Var5_sd (dbl),
  Var6_sd (dbl), Var7_sd (dbl), Var8_sd (dbl), Var9_sd (dbl), Var10_sd (dbl),
  Var11_sd (dbl), Var12_sd (dbl), Var13_sd (dbl), Var14_sd (dbl), Var15_sd
  (dbl), Var16_sd (dbl), Var17_sd (dbl), Var18_sd (dbl), Var19_sd (dbl),
  Var20_sd (dbl), Var21_sd (dbl), Var1_median (dbl), Var2_median (dbl),
  Var3_median (dbl), Var4_median (dbl), Var5_median (dbl), Var6_median (dbl),
  Var7_median (dbl), Var8_median (dbl), Var9_median (dbl), Var10_median (dbl),
  Var11_median (dbl), Var12_median (dbl), Var13_median (dbl), Var14_median
  (dbl), Var15_median (dbl), Var16_median (dbl), Var17_median (dbl),
  Var18_median (dbl), Var19_median (dbl), Var20_median (dbl), Var21_median
  (dbl)

Downside of this approach is that we get number_of_vars times number_of_summary_functions variables in the output data.frame. Alternatively, we can use tidyr to organize our data from wide format to long format. Then we use dplyr to get the results:

library(tidyr)
dat %>% gather(variable, value, -city, -speaker) %>% 
  group_by(city, speaker, variable) %>%
  summarise_each(funs(mean, sd, median), value)
Source: local data frame [210 x 6]
Groups: city, speaker [?]

     city speaker variable      mean        sd    median
   (fctr)   (chr)   (fctr)     (dbl)     (dbl)     (dbl)
1    Erie   Erie1     Var1 0.5531500 0.2836093 0.5969408
2    Erie   Erie1     Var2 0.4776046 0.3118265 0.4591285
3    Erie   Erie1     Var3 0.5256391 0.2927646 0.5126190
4    Erie   Erie1     Var4 0.4732230 0.2810146 0.4556239
5    Erie   Erie1     Var5 0.4647291 0.2932984 0.4461107
6    Erie   Erie1     Var6 0.5062291 0.2924258 0.5132119
7    Erie   Erie1     Var7 0.4815738 0.2928289 0.4526164
8    Erie   Erie1     Var8 0.4920858 0.2976184 0.5169642
9    Erie   Erie1     Var9 0.4900656 0.2793954 0.4935924
10   Erie   Erie1    Var10 0.4626460 0.2807313 0.4608666
..    ...     ...      ...       ...       ...       ...

This adds an additional variable that codes the variable, and a variable per summary function.

Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149