0

Specs: R 3.2.4, Windows 7 Enterprise SP1 (32-bit)

I'm trying to do a boxplot on a subset of a data frame, grouped by a particular level. However, I'm obviously doing something wrong, because it's plotting for all the levels in the original frame, not the subset.

We have an online banking platform that does "real time" communications with about 500 client institutions, and we're seeing some slow response times for some clients. I'm trying to use R to visualize the data in different ways to look for a pattern.

My data frame is a 1-hour snapshot of message response times across all institutions during a particularly busy morning. This snapshot is generated from a database query and saved to a .csv file on the file system:

rt=read.csv("\\path\\to\\csv\\file",header=TRUE)

The structure of the data frame is message sequence #, network id, institution id, date, message class, and elapsed time for the message. Network id refers to the specific communications interface (we have about 28-30 active interfaces).

I've created a subset of that snapshot by picking institutions that belong to a particular network:

rt.network=subset(rt,rt$Network==41)

At this point, rt.network should only contain observations for 4 institutions:

levels(factor(rt.network$Institution))
[1] "INST1" "INST2" "INST3" "INST4"

So far so good. Now I want to see a box plot of the elapsed times for each of those institutions, so I do the following:

boxplot(Elapsed~Institution,data=rt.network,outline=FALSE)

I expect to results for only those institutions in the subset frame; however, R is plotting results for all ~500 institutions, where all but 4 are empty and those 4 are uselessly skinny (don't have an easy way to share the image, sorry; just imagine a box plot where the X axis has 500 entries, and 4 one-to-two-pixel wide boxes).

The Question - why is R generating plots for institutions not contained within the subset data frame? What have I done wrong in the boxplot or subset commands?

Needless to say, I'm confused; I don't understand why those empty levels are showing up in the plot at all.

If necessary, I can filter the results I want from the database and reload; I just thought it would be nice to load all the data once, and do the filtering/subsetting within R.

John Bode
  • 119,563
  • 19
  • 122
  • 198
  • 1
    Maybe see `?droplevels`? – zx8754 Mar 16 '16 at 19:53
  • 2
    See [here](http://stackoverflow.com/questions/1195826/drop-factor-levels-in-a-subsetted-data-frame). [This](http://www.stat.berkeley.edu/~s133/factors.html) could be an interesting read too maybe. And most importantly, please read [this](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) if you planning to post additional questions on [tag:r] in the future. – David Arenburg Mar 16 '16 at 19:57
  • 1
    @Frank: Typo, sorry, will fix. – John Bode Mar 16 '16 at 19:58
  • Ok, I think you should be looking at `levels(rt.network$Institution)`, not `levels(factor(rt.network$Institution))`. Probably, by wrapping it in `factor()` you are implicitly dropping levels / releveling. – Frank Mar 16 '16 at 20:01
  • 1
    @Frank Not probably, he is. From `?factor`: "The default is the unique set of values taken by as.character(x)". – joran Mar 16 '16 at 20:04
  • @joran: Note that I'm using `factor` just to see what institutions are in the data set; I'm not using that as part of the plotting command, or in the creation of the subset. – John Bode Mar 16 '16 at 20:05
  • @zx8754: That looks promising, but I'm not sure where that needs to go in the workflow; any suggestions of where I should wedge that in? – John Bode Mar 16 '16 at 20:06
  • Yes, I know that. What _we_ are saying is that by using `factor` to look at what is in the data set you are _altering the data set_. – joran Mar 16 '16 at 20:06
  • @DavidArenburg: Thanks, that was it. Y'all can go ahead and close this one off as a duplicate, then. – John Bode Mar 16 '16 at 20:08
  • Just before boxplot, `rt.network$Institution <- droplevels(rt.network$Institution)` – zx8754 Mar 16 '16 at 20:48
  • alternative workaround (which I use) after subsetting data.frames is I refactor the variable I subsetted on, to not get this problem: `rt.network=subset(rt,rt$Network==41)`, then `rt.network$Network <- factor(rt.network$Network)` – OFish Mar 16 '16 at 23:41

0 Answers0