boxplot single scalar variable "by" multiple true/false variables in r data

Question

I've been limping my way around r data for a few months now. Sorry if any of this seems basic. I've been finding all kinds of close problems and solutions, but somehow I can't seem to adapt them to my situation. Starting to wonder if it's something I should be trying to do at all, but I suppose it can't hurt to ask.

I have a data frame that has a single scalar variable, and multiple T/F (yes/no; 1/0, 1/2) variables. like this:

    scal var1 var2 var3
     25   0   1    0
     21   0   1    1
     14   1   1    0
     30   1   0    1

I know I can make a boxplot which separates the scalar variable column into categories using "by" for a single variable, like so:

boxplot(df$scal~df$var1)

I also know that I can make box plots for multiple scalar variables at once. I'd like to combine the two somehow to make a boxplot which can plot the dependent variable of each "true" subset and "false" subset of each variable next to one another. In my world, one solution should look something like "boxplot(df$scal~df$var1, df$scal~df$var2, df$scal~df$var3)", but r data doesn't agree with me. something about not being able to force a datatype.

I could also write a rough loop to go through each of the variables and generate all the plots separately, but I'd like to compare them side-by-side.

I've also thought to rearrange the dataset such that the "true" and "false" sets are in different columns (using subset(df$var1, df$var1==1) etc.), then making multiple boxplots as described before. (though this is quite tedius)

var1t var1f var2t var2f var3t var3f
 14    25    25    30    21    25
 30    21    21          30    14
             14                  
boxplot(df2$var1t, df2$var1f, df2$var2t, df2$var2f, df2$var3t, df2$var3f)

However, the different lengths(number of rows) of the columns is giving me fits when creating the new dataset. I know that I can make a dataset in another program (saved as .csv, .xls, etc.) then import it. The null values would remain intact, but I'd really rather not do this manually. As one might imagine, this becomes quite tedious and prone to errors on larger scales.

Help with either approach would be most welcome.

I think using `ggplot2` would make your life a lot simpler. You can check out this post http://stackoverflow.com/questions/20060949/ggplot2-multiple-sub-groups-of-a-bar-chart which is bar chart, but I think is the same idea you are looking for. — jentjr, Apr 15 '15 at 16:48
I've seen solutions similar to that one. but this is almost the opposite. those have one categorical variable and multiple scalar, where I have one scalar and multiple categorical. I couldn't manage to adapt the methods. — babelguppy, Apr 17 '15 at 07:50

score 1 · Accepted Answer · answered Apr 15 '15 at 17:42

Learning how to manipulate data in R can be hard when you're starting out. I agree with with @jentjr that learning ggplot2 would be helpful and Hadley's book provides great tips for working with data in addition to covering ggplot2.

To start off, I would suggest using the reshape2 Package to melt your data:

(I created a dummy set so it would be easier for other people to follow along)

library(reshape2)
nObs = 10
df = data.frame(
    scal = rnorm(nObs), 
    var1 = rbinom(nObs, 1, 0.5),
    var2 = rbinom(nObs, 1, 0.5),
    var3 = rbinom(nObs, 1, 0.5))

Then `melt' the data into long form from wide form.

df2 = melt(df, id.vars = c('scal'), 
    variable.name ='myVars', value.name = "zeroOne")

Now you may create your desired boxplot using base R: enter image description here

However, investing the time to learn ggplot2 would allow you to create figures such as this one: enter image description here

Using code such as this:

library(ggplot2)
ggplot(data = df2, aes(x = zeroOne, y = scal)) + 
    geom_boxplot(aes(fill = myVars))

Note ggplot2 can make much fancier plots than this (and do so more easily than base R!) and I would encourage you to browse the ggplot2 webpage to see more examples. You may also wish to experiment with swappingzeroOne and myVars because it changes the plot groupings.

Thanks. Took some muddling, but this helped. I'd been seeing a lot of stuff with the melt() function, but it's not what i expected. that link probably helped the most. just to finish this out, I used `boxplot(scale~myVars*zeroOne, data=df2)` and `ggplot(data = df2, aes(x = interaction(zeroOne, myVars), y = scal)) + geom_boxplot(aes(fill = zeroOone))` to get what i finally needed. — babelguppy, Apr 17 '15 at 07:45

stefan.schroedl · Answer 2 · 2015-04-16T09:43:12.353

Plotluck is a library based on ggplot2 that aims at automating the choice of plot type based on characteristics of 1-3 variables. Here is an example with the resulting plot:

nObs = 100
df = data.frame(
    scal = rnorm(nObs), 
    var1 = rbinom(nObs, 1, 0.5),
    var2 = rbinom(nObs, 1, 0.5),
    var3 = rbinom(nObs, 1, 0.5))
plotluck.multi(df, y=scal, opts=plotluck.options(use.geom.violin=F))

This command means: Plot column scal (on the y-axis) against each other column in df (on the x-axis; including itself, resulting in a density or histogram). We specify use.geom.violin=F to enforce a box plot, since the default is a violin plot, which can often convey better the shape of the distribution. If the number of rows is very low, individual points will be plotted.

enter image description here

boxplot single scalar variable "by" multiple true/false variables in r data

2 Answers2