How to compare the distributions of two vectors in R?

Question

Here is a screenshot of my dataset:

enter image description here

Here's what it's about: Imagine that you work in a delivery company and, for some reason, the package fails to be delivered to the client. The distribution of the number of packages returned changes according to the monetary value of the package, which is the first variable of the dataset (Levels). So, column B represents the distribution of all packages sold by the company last month, grouped by the value of the package. The last column, C, represents the distribution of packages that failed to be delivered because of some criteria (say, dangerous neighborhood) .

What I want to visually show is that this specific criteria is so important that it changes the distribution of the data. I used Excel to calculate those percentages from the raw data because I'm not allowed to install R at work.

I've done the following plot by doing some data wrangling, but I guess I could do better if I knew how:

enter image description here

Edit: I was told to post a dput version of the dataset:

structure(list(Levels = structure(c(6L, 11L, 12L, 13L, 1L, 2L, 
3L, 4L, 5L, 7L, 8L, 9L, 10L), .Label = c("Less than $1000", "Less than $1200", 
"Less than $1400", "Less than $1600", "Less than $1800", "Less than $200", 
"Less than $2000", "Less than $2200", "Less than $2400", "Less than $2600", 
"Less than $400", "Less than $600", "Less than $800"), class = "factor"), 
    X.ofTotal = c(0.3802, 0.2475, 0.1218, 0.0664, 0.0409, 0.0247, 
    0.0178, 0.016, 0.0099, 0.0109, 0.0061, 0.0063, 0.0063), X..ofTotalWithSomeCriteria = c(0.6087, 
    0.1957, 0.0652, 0.0435, 0, 0.0217, 0, 0, 0.0435, 0.0217, 
    0, 0, 0)), .Names = c("Levels", "X.ofTotal", "X..ofTotalWithSomeCriteria"
), class = "data.frame", row.names = c(NA, -13L))
>

You're about to be closed because you're asking a vague question and have not stated it very well. You will get more of a response if you try something first, present a [minimal working example](http://stackoverflow.com/help/mcve) with some data and code, and ask what went wrong. Please see [this](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) as well. (What are the `Variable`s? Is this a time-series? Categories? As it stands, **the answer to your question is "It Depends (tm)"** on way too many things you have not shared with us.) — r2evans, Jun 12 '15 at 04:55
I'm sorry @r2evans, I guess I'm having trouble explaining what I want because I'm not fluent in English. I'll try to add more details and open a new thread. — iatowks, Jun 12 '15 at 04:59
Don't open a new thread if it's the same question, just edit this question. English fluency is nice for many of us but you are certainly not the first. Provide a simple example, give us *some* data (read the links I mentioned above, perhaps use `dput`), and go from there. Please don't give us *all* the data if you can make a sample problem smaller. — r2evans, Jun 12 '15 at 05:01
Maybe `?pairs` might be a helpful generic function if you are trying to visually do pairwise comparisons of variables. — thelatemail, Jun 12 '15 at 05:22
@iatowks, with just two variables, it might be insightful to use `plot(dat[,2:3])`, but since I believe you want to show more than two variables (from your pre-edited question), as @thelatemail suggested, try `pairs(dat[,-1])`. (Better posing of the question, by the way.) — r2evans, Jun 12 '15 at 05:53
Probably should have gone on the Cross Validated Stack Exchange. — Mike Wise, Jun 12 '15 at 09:16

HOSS_JFL · Accepted Answer · 2015-06-12T08:47:48.157

I would plot the empirical cumulative distribution function. This makes sense because the comparison of these two functions is also the basis for the Kolmogorov–Smirnov test for the significance of the difference of the two distributions.

There are at least two options to plot these functions in R:

plot(ecdf(data$X.ofTotal),col="green",xlim=c(0,1),verticals = TRUE,main = "")
par(new=TRUE)
plot(ecdf(data$X..ofTotalWithSomeCriteria ),col="red",xlim=c(0,1),verticals = TRUE,main = "")

require( Hmisc )
l <- length(data$X..ofTotalWithSomeCriteria )
dataset <- c(rep("Total",l), rep("Criteria", l))  
Ecdf(c(data$X.ofTotal, data$X..ofTotalWithSomeCriteria ), group=dataset, col=c('blue', 'red'))

so is the question answered? – HOSS_JFL Jun 13 '15 at 08:35 — HOSS_JFL, Jun 13 '15 at 08:35

How to compare the distributions of two vectors in R?

1 Answers1