0

When using the plyr package to summarise my data, it seems impossible to use the nlevels() function.

The structure of my data set is as follows:

>aer <- read.xlsx("XXXX.xlsx", sheetIndex=1)
>aer$ID <- as.factor(aer$ID)
>aer$description <- as.factor(aer$description)    
>head(aer)

  ID SOC   start        end         days  count severity relation
1  1 410   2015-04-21   2015-04-28    7     1        1        3
2  1 500   2015-01-30   2015-05-04   94     1        1        3
3  1 600   2014-11-25   2014-11-29    4     1        1        3
4  1 600   2015-01-02   2015-01-07    5     1        1        3
5  1 600   2015-01-26   2015-03-02   35     1        1        3
6  1 600   2015-04-14   2015-04-17    3     1        1        3

> dput(head(aer,4))
structure(list(ID = structure(c(1L, 1L, 1L, 1L), .Label = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "12", "13", "14", 
"15"), class = "factor"), SOC = c(410, 500, 600, 600),  
start = structure(c(16546, 16465, 16399, 16437), class = "Date"), 
end = structure(c(16553, 16559, 16403, 16442), class = "Date"), 
days = c(7, 94, 4, 5), count = c(1, 1, 1, 1), severity = c(1, 
1, 1, 1), relation = c(3, 3, 3, 3)), .Names = c("ID", "SOC", 
"description", "start", "end", "days", "count", "severity", "relation"
), row.names = c(NA, 4L), class = "data.frame")

What I would like to know is how many levels exists in the "ID" variable in data sections created, when dividing the data set using the variable "SOC". I want to summarise this information together with some other variables in a new data set. Therefore, I would like to use the plyr package like so:

summaer2 <- ddply(aer, c("SOC"), summarise,
    participants    = nlevels(ID), 
    events          = sum(count),
    min_duration    = min(days), 
    max_duration    = max(days),
    max_severity    = max(severity))

This returns the following error:

Error in Summary.factor(c(4L, 5L, 11L, 11L, 14L, 14L), na.rm = FALSE) : 
‘max’ not meaningful for factors

Could someone give me advice on how to reach my goal? Or what I'm doing wrong?

Many thanks in advance!

RmyjuloR
  • 369
  • 1
  • 4
  • 13
  • Are you sure `nlevels()` is the problem? Seems like it's complaining about `max()`, are you sure `days` and `severity` are numeric? You should share your input data in a [reproducible format](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) (ie a `dput()`) so we can see how you inpoted your data. – MrFlick Jun 22 '15 at 14:58
  • @MrFlick I've substituted nlevels() with length(), then it works fine. (But I don't get my levels, just the length of the data sections.... :) ) – RmyjuloR Jun 22 '15 at 15:01
  • @Veerendra Gadekar `max(levels(ID)` gives my the same number for every data section, which is not correct. Does not give me an error though. – RmyjuloR Jun 22 '15 at 15:04
  • I think I found the solution: subtituting `nlevels(ID)` with `length(unique(ID))` gives me the number of levels per section... – RmyjuloR Jun 22 '15 at 15:10
  • @Veerendra Gadekar `max(as.vector(severity))` also gives me incorrect values – RmyjuloR Jun 22 '15 at 15:20
  • @RmyjuloR please mention how your desired output should look like – Veerendra Gadekar Jun 22 '15 at 15:22
  • @Veerendra Gadekar See the updated post above :) – RmyjuloR Jun 22 '15 at 15:34
  • @RmyjuloR so may be you can answer the question yourself to mark the question as answered – Veerendra Gadekar Jun 22 '15 at 15:40

1 Answers1

0

Update:

Substituting nlevels(ID) with length(unique(ID)) seems to give me the desired output:

> head(summaer2)
   SOC participants events min_duration max_duration max_severity
1  100            4      7            1           62            2
2  410            9     16            1           41            2
3  431            2      2          109          132            1
4  500            5      9           23          125            2
5  600            8     19            1           35            1
6 1040            1      1           98           98            2
RmyjuloR
  • 369
  • 1
  • 4
  • 13