1

I'm trying to illustrate the effects, by ID, on sample size of successively applying various (decreasingly restrictive) sample restrictions in a bar plot that looks something like this:

ideal_output

The blue bar is what remains after all 5 restrictions are placed; the gold bar shows the impact of the least restrictive condition; the spring green bar shows the impact of the second-least restrictive condition; and so forth.

Here's some sample data:

library(data.table)
set.seed(8195)
dt<-data.table(id=rep(1:5,each=2e3),flag1=!!runif(1e4)>.76,
               flag2=!!runif(1e4)>.88,flag3=!!runif(1e4)>.90,
               flag4=!!runif(1e4)>.95,flag5=!!runif(1e4)>.99)

The code I'm using so far leaves something to be desired-- 1) it's rather verbose and 2) it doesn't strike me as very robust/generalizable. Does anyone have some experience producing something like this that can offer some improvements on either of these fronts? I have a feeling this type of graph should be pretty common in data analysis, so I'm sort of surprised there's not a special function for it.

Here's what I'm doing so far:

dt[order(-id)][,
                #to find out how many observations are lost by
                #  applying flag 1 (we keep un-flagged obs.), 
                #  look at the count of indices before and
                #  after applying flag 1
               {l1<-!flag1;i1<-.I[l1];n1<-length(.I)-length(i1);
                #to find the impact of flag 2, we apply flag 2
                #  _in addition to_ flag 1--the observations
                #  we keep have _neither_ flag 1 _nor_ flag 2;
                #  the impact is measured by the number of 
                #  observations lost by applying this flag
                #  (that weren't already lost from flag 1)
               l2<-l1&!flag2;i2<-.I[l2];n2<-length(i1)-length(i2);
               l3<-l2&!flag3;i3<-.I[l3];n3<-length(i2)-length(i3);
               l4<-l3&!flag4;i4<-.I[l4];n4<-length(i3)-length(i4);
               l5<-l4&!flag5;i5<-.I[l5];n5<-length(i4)-length(i5);
               #finally, the observations we keep have _none_
               #  of flags 1-5 applied
               n6<-length(i5);
               c(n6,n5,n4,n3,n2,n1)},by=id
               ][,{barplot(matrix(V1,ncol=uniqueN(id)),
                           horiz=T,col=c("blue","gold","springgreen",
                                         "orange","orchid","red"),
                           names.arg=paste("ID: ",uniqueN(id):1),
                           las=1,main=paste0("Impact of Sample Restrictions",
                                             "\nBy ID"),
                           xlab="Count",plot=T)}]

Not pretty. Thanks for your input.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • 2
    the `!!` are redundant, you already have logical with the equality tests. But, they look cool :) – Rorschach Aug 05 '15 at 18:20
  • All the negations on top of negations in your lower block of code are quite confusing. Maybe you could explain in words what that's all about? – Frank Aug 05 '15 at 18:24
  • @Frank I added some comments to the code. Perhaps it wasn't clear that the observations we keep have `flag_k=F` at each stage. – MichaelChirico Aug 05 '15 at 18:29
  • 1
    Okay. I think the obvious thing to do is create a categorical variable according to the rules you're applying. Then you can tabulate it. (I may be misunderstanding the input demanded by `barplot`.) – Frank Aug 05 '15 at 18:30
  • My best guess for that categorical var is `dt[, false_flag := TRUE][,g := max.col(.SD), by=id]` based on your comments in the code, but it doesn't match up with your numbers exactly. – Frank Aug 05 '15 at 18:41
  • can you do a `cumsum` by row? I can't even figure out how to do this in `data.table` but it seems like it might work. – Rorschach Aug 05 '15 at 18:44
  • `cumsum` shouldn't work because the logicals are recursive, so order matters, e.g. if `flag2==F` but `flag1==T` for an observation – MichaelChirico Aug 05 '15 at 18:45
  • @MichaelChirico As far as I can tell, only the first TRUE matters (if there is any), so you should be able to use `max.col` here... – Frank Aug 05 '15 at 18:54
  • yes, but you would have the index of the `cumsum` as well – Rorschach Aug 05 '15 at 19:01
  • I think @Frank is on the right path: `dt[,categ:=max.col(.SD[,paste0("flag",5:1),with=F],ties.method="first")]; dt[,barplot(table(categ,id),horiz=T)]` is _just_ off. Only problem is it mis-assigns category 1 if all are false. – MichaelChirico Aug 05 '15 at 19:09
  • 1
    @Frank got it: instead of `.SD[...]`, use `cbind(.5,.SD[...])` so that if all are false, `max.col` is 1. – MichaelChirico Aug 05 '15 at 19:10
  • Well, for the sake of our edification, maybe you could post a solution :) I would surely like to make such a nice graph some day. I think `rev(.SD)` may also work – Frank Aug 05 '15 at 19:12

1 Answers1

2

As @Frank pointed out, this is much simpler if all these successive flags are converted to a categorical variable taking, say, 1 for the blue bars, 2 for the gold bars, 3 for the spring green bars, and so on.

As @Frank also pointed out, max.col offers us a convenient way of creating a variable that takes exactly those values, and quickly:

dt[,categ2:=max.col(cbind(.5,.SD),ties.method="last"),
   .SDcols=paste0("flag",5:1)]

(What's happening here? max.col is taking care of the recursive nature of the flags for us my assigning the rightmost--because ties.method="last"--TRUE value in each column; if all flags are FALSE, the first column is largest because it is always .5, which is greater than 0. Check out this table:)

 0 1 2 3 4 5
.5 F F F F F # No flags apply, so column 0 wins
.5 T F T F F # Flags 1 & 3 true--3 is the binding condition--
             #   Once Flag 5 is applied, it no longer matters
             #   which of the subsequent flags may or may not apply.

With categ thus defined, graphic becomes a cinch:

dt[,barplot(table(categ,id))]

Will work. To get all the bells and whistles:

dt[,barplot(table(categ,id)[,5:1],horiz=T,
            col=c("blue","gold","springgreen",
                  "orange","orchid","red"),
            names.arg=paste("ID: ",uniqueN(id):1),
            las=1,main=paste0("Impact of Sample Restrictions",
                              "\nBy ID"),
            xlab="Count",plot=T)]

enter image description here

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • Nice. As I mentioned above, `.SD[,paste0("flag",5:1),with=F]` is `rev(.SD)` as far as I can tell. If not, `.SDcols` would be more idiomatic. – Frank Aug 05 '15 at 19:33
  • @Frank have to worry about `id` column. You're right about `.SDcols` since `id` is unused; revised. – MichaelChirico Aug 05 '15 at 19:35