0

I have a data.frame with two columns, a unique identifier and a result. I need to loop through the data.frame and get a count of how many unique Identifiers there are and than a count of the unique results. The results column can have three possible results, Positive, Negative or Ambiguous. So for example if there are 10 “RVP PCR” identifiers I need to create a row with four more columns, "Count", “Positive”, “Negative”, “Ambiguous” and in those columns should be a count of how many times they occurred. So in the example with 10 ”RVP PCR” Identifiers the output row should show the Identifier than count 10, 7 Negatives, 1 Positive and 2 Ambiguous. How would you accomplish this with R ?

str(foo)
>
'data.frame':   51 obs. of  2 variables:
 $ identifier: Factor w/ 99 levels "ADENOPCR","ALB-BF",..: 51 51 56 56 57 57 57 57 18 18 ...
 $ result    : Factor w/ 3 levels "Ambiguous","Negative",..: 2 1 2 1 2 1 2 1 2 1 ...



dput(foo)
>
    structure(list(identifier = structure(c(80L, 80L, 80L, 80L, 80L, 
80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 
80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 64L, 
18L, 18L, 76L, 76L, 76L, 70L, 70L, 70L, 70L, 71L, 64L, 77L, 77L, 
77L, 77L, 77L, 77L, 77L, 77L, 76L), .Label = c("ADENOPCR", "ALB-BF", 
"ASPERAG", "ASPERAGB", "BDGLUCAN", "BLASTO", "BORD PCR", "BPERT", 
"CMV QNT", "CMVPCR", "COCCI", "COCCI G/M", "COCCI PAN", "COCCI-PPT", 
"CPNEUMOPCR", "CRP", "CRY BLD", "CWP-KOH", "DIFF CONF", "EBV PAN", 
"EBV PAN 2", "EBV QNT", "EXCEPT", "EXCEPT TT", "FLUFAC", "FUNG PKG", 
"FUNGSEQ", "GLU-FL", "HERP I", "HHV6PCR", "HISTO", "HISTO PPT", 
"HISTOAG S", "HISTOGM U", "HMPVFA", "HMPVPCR", "HSVPCR", "LEGAG-U", 
"LEGIONFA", "LEGIONPCR", "MA AFB", "MA FUNGAL", "MA MIC", "MA MTBPRIM", 
"MC AFB", "MC AFBID", "MC AFBR", "MC BAL", "MC BLD", "MC CYST", 
"MC FUNG", "MC FUNGID", "MC Legion", "MC LEGION", "MC MTD", "MC NOC", 
"MC RESP", "MC STAPH", "MC Strep", "MC STREP", "MC VRE", "MC W", 
"MICROSEQ", "MPNEUMOPCR", "MS CWP", "MTBRIF PCR", "MYCO-M", "NG REPORT", 
"ORGSEQ", "PARAFLUPCR", "PCP PCR", "PNEUMO AB", "PNEUMST", "PNEUMST R", 
"RESPMINI", "RESPMINI ", "RSPFA", "RSPFAC", "RSV", "RVP PCR", 
"RVPPCR", "SPN AG", "TP-FL", "V CMVC", "V FLUC", "V HSVC", "V HSVCT", 
"V RESPC", "V Urea", "V VIC", "V VIC R", "V VIRAL", "V VIRAL N", 
"V VIRAL R", "V VZV", "VDRL CSF", "VZVFAC", "VZVPCR", "WNILE PCR"
), class = "factor"), result = structure(c(2L, 2L, 3L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 3L, 
2L, 2L, 2L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Ambiguous", 
"Negative", "Positive"), class = "factor")), .Names = c("identifier", 
"result"), row.names = 1500:1550, class = "data.frame")
Nodedeveloper101
  • 421
  • 1
  • 9
  • 24

5 Answers5

2

I'm not totally sure what your expected output is, but you can reshape your data:

library(reshape2)

dcast(foo, identifier~result, fun.aggregate= length)

This produces:

  identifier Negative Positive
1    CWP-KOH        2        0
2 MPNEUMOPCR        0        2
3 PARAFLUPCR        3        1
4    PCP PCR        0        1
5  RESPMINI         4        0
6      RSPFA        7        1
7    RVP PCR       28        2

######## EDIT TO ADD #############

With the data your provided, there's no way "RVP PCR" will yield the results you stated.

PavoDive
  • 6,322
  • 2
  • 29
  • 55
2
library(dplyr)
library(tidyr)
foo %>%
  group_by(identifier, result) %>%
  summarise(n = n()) %>%
  spread(key = result, value = n, drop = FALSE, fill = 0) %>%
  mutate(Total = Ambiguous + Negative + Positive) %>%
  filter(Total > 0)

The result

Source: local data frame [7 x 5]
Groups: identifier [7]

  identifier Ambiguous Negative Positive Total
      (fctr)     (dbl)    (dbl)    (dbl) (dbl)
1    CWP-KOH         0        2        0     2
2 MPNEUMOPCR         0        0        2     2
3 PARAFLUPCR         0        3        1     4
4    PCP PCR         0        0        1     1
5  RESPMINI          0        4        0     4
6      RSPFA         0        7        1     8
7    RVP PCR         0       28        2    30
Thierry
  • 18,049
  • 5
  • 48
  • 66
1

Without additional packages you can do:

xtabs(~ identifier + result, data=droplevels(foo))

This gives this result:

> xtabs(~ identifier + result, data=droplevels(foo))
            result
identifier   Negative Positive
  CWP-KOH           2        0
  MPNEUMOPCR        0        2
  PARAFLUPCR        3        1
  PCP PCR           0        1
  RESPMINI          4        0
  RSPFA             7        1
  RVP PCR          28        2

If you want a dataframe:

as.data.frame(unclass(xtabs(~ identifier + result, data=droplevels(foo))))

If you want the result in a long format, you can also do:

foo$count <- 1
aggregate(count ~ identifier+result, data=foo, FUN=length)
jogo
  • 12,469
  • 11
  • 37
  • 42
  • Great base solution, yet there could be some use of the result being a data frame? – PavoDive Feb 25 '16 at 19:49
  • @PavoDive I edited my answer to include a dataframe-variant. If you like my base solution you can vote it up. – jogo Feb 27 '16 at 09:38
1

The data is in long format. Change it to wide first using dcast command from reshape2 library. Add a column and take sum of all the rows.

library(reshape2)    
widedata<-dcast(foo,identifier~result)
widedata$Count<-0 #adds column for Count
widedata$Count<-rowSums (widedata[,2:4], na.rm = FALSE, dims = 1) #[,2:4] since the data will have a column for ambiguous as well.
0
library(tidyr)
library(dplyr)

foo %>%
  count(identifier, result) %>%
  spread(result, n) # or spread(result, n, fill = 0, drop = FALSE)

#   identifier Negative Positive
#       (fctr)    (int)    (int)
# 1    CWP-KOH        2       NA
# 2 MPNEUMOPCR       NA        2
# 3 PARAFLUPCR        3        1
# 4    PCP PCR       NA        1
# 5  RESPMINI         4       NA
# 6      RSPFA        7        1
# 7    RVP PCR       28        2
JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116