Extracting/Exporting the Data of the Empirical Cumulative Distribution Function in R (ecdf)

Question

I use R to calculate the ecdf of some data. I want to use the results in another software. I use R just to do the 'work' but not to produce the final diagram for my thesis.

Example Code

# Plotting the a built in sampla data
plot(cars$speed)
# Assingning the data to a new variable name
myData = cars$speed
# Calculating the edcf
myResult = ecdf(myData)
myResult
# Plotting the ecdf
plot(myResult)

Output

> # Plotting the a built in sampla data
> plot(cars$speed)
> # Assingning the data to a new variable name
> myData = cars$speed
> # Calculating the edcf
> myResult = ecdf(myData)
> myResult
Empirical CDF 
Call: ecdf(myData)
 x[1:19] =      4,      7,      8,  ...,     24,     25
> # Plotting the ecdf
> plot(myResult)
> plot(cars$speed)

enter image description here

Questions

Question 1

How do I get the raw information in order to plot the ecdf diagram in another software (e. g. Excel, Matlab, LaTeX)? For the histogram function I can just write

res = hist(...)

and I find all the information like

res$breaks
res$counts
res$density
res$mids
res$xname

Question 2

How do I calculate the inverse ecdf? Say I want to know how many cars have a speed below 10 mph (the example data is car speed).

Update

Thanks to the answer of user777 I have more information now. If I use

> myResult(0:25)
 [1] 0.00 0.00 0.00 0.00 0.04 0.04 0.04 0.08 0.10 0.12 0.18 0.22 0.30 0.38
[15] 0.46 0.52 0.56 0.62 0.70 0.76 0.86 0.86 0.88 0.90 0.98 1.00

I get the data for 0 to 25 mph. But I do not know where to draw a data point. I do want to reproduce the R plot exactly.

Here I have a data point every 1 mph.

enter image description here

Here I do not have a data pint every 1 mph. I only have a data point if there is data available.

enter image description here

Solution

# Plotting the a built in sample data
plot(cars$speed)
# Assingning the data to a new variable name
myData = cars$speed
# Calculating the edcf
myResult = ecdf(myData)
myResult
# Plotting the ecdf
plot(myResult)
# Have a look on the probability for 0 to 25 mph
myResult(0:25)
# Have a look on the probability but just where there ara data points
myResult(unique(myData))
# Saving teh stuff to a directory
write.csv(cbind(unique(myData), myResult(unique(myData))), file="D:/myResult.txt")

The file myResult.txt looks like

"","V1","V2"
"1",4,0.04
"2",7,0.08
"3",8,0.1
"4",9,0.12
"5",10,0.18
"6",11,0.22
"7",12,0.3
"8",13,0.38
"9",14,0.46
"10",15,0.52
"11",16,0.56
"12",17,0.62
"13",18,0.7
"14",19,0.76
"15",20,0.86
"16",22,0.88
"17",23,0.9
"18",24,0.98
"19",25,1

Meaning

enter image description here

Attention: I have a German Excel so the decimal symbol is comma instead of the dot.

One useful idiom to keep in mind that helps to figure out what is happening is to call the function `ecdf(x)` with the data as an argument: `ecdf(x)(x)`. You may want to supplement it with some additional x-values (e.g. one to the left of the sample). — Glen_b, Jul 27 '14 at 23:44
Hi. Thanks. I do not get it yet what f(x)(x) means. But I will try it later. — Dr. Manuel Kuehner, Jul 28 '14 at 11:20
You call a function `g` of a variable `x` by `g(x)`. Now `ecdf(x)` - the output of the function `ecdf` - is *itself* a function. So you call the function `ecdf(x)` by passing it an argument: `ecdf(x)(x)`. — Glen_b, Jul 28 '14 at 12:38
There's an example of it [here on SO](http://stackoverflow.com/questions/15792552/numpy-scipy-equivalent-of-r-ecdfxx-function) — Glen_b, Jul 28 '14 at 14:09

Sycorax · Accepted Answer · 2014-07-27T20:30:40.130

3

The output of ecdf is a function, among other object types. You can verify this with class(myResult), which displayes the S4 classes of the object myResult.

If you enter myResult(unique(myData)), R evaluates the ecdf object myResult at all distinct values appearing in myData, and prints it to the console. To save the output you can enter write.csv(cbind(unique(myData), myResult(unique(myData))), file="C:/Documents/My ecdf.csv") to save it to that filepath.

The ecdf doesn't tell you how many cars are above/below a specific threshold; rather, it states the probability that a randomly selected car from your data set is above/below the threshold. If you're interested in the number of cars satisfying some criteria, just count them. myData[myData<=10] returns the data elements, and length(myData[myData<=10]) tells you how many of them there are.

Assuming you mean that you want to know the sample probabilities that a randomly-selected car from your data is below 10 mph, that's the value given by myResult(10).

edited Jul 27 '14 at 20:30

answered Jul 27 '14 at 18:48

Sycorax

1,298
13
26

Thanks! And how can I exactly reproduce the plot with another software? I would be happy if I would have a csv file or something similar. – Dr. Manuel Kuehner Jul 27 '14 at 19:12
Thanks again. Maybe I did not describe my wish exact enough, I updated the question. – Dr. Manuel Kuehner Jul 27 '14 at 19:16
Thanks - that's almost what I need ;). I edited the question. In the R plot there in not every 1 mph a data point because in the data the points are not every 1 mph. – Dr. Manuel Kuehner Jul 27 '14 at 19:25
Great - thanks a lot! I will include the solution in the question so that others have just to look in one place. – Dr. Manuel Kuehner Jul 27 '14 at 19:31
What do I have to do to get an upvote for a question around here? :) – Dr. Manuel Kuehner Jul 27 '14 at 19:43
I see. What would then be the appropriate stackexchange for that kind of questions? – Dr. Manuel Kuehner Jul 27 '14 at 20:01
Haha - great. Before I posted the question I googled 'stackoverflow R statistic' and I landed here. But that's my fault. Thanks again. – Dr. Manuel Kuehner Jul 27 '14 at 20:09

score 3 · Answer 2 · answered Jul 27 '14 at 19:48

3

As I see it, your main requirement is to reproduce the jumps at each x value. Try this:

> x <- c(cars$speed, cars$speed, 1, 28)
> y <- c((0:49)/50, (1:50)/50, 0, 1)
> ord <- order(x)
> plot(y[ord] ~ x[ord], type="l")

Resulting plot

The first 50 (x,y) pairs are tyhe beginnings of the jumps, the next 50 are the ends, and the last two give you starting and ending values at $(x_1-3,0)$ and $(x_{50}+3,1)$. Then you need to sort the values in increasing order in $x$.

answered Jul 27 '14 at 19:48

Russ Lenth

5,922
2
13
21

1

PS - But you could do this just as easily in other software, i.e. just save the x values, import them ino Excel, say, and do the equivalent of this using spreadsheet formulas and sorting. – Jul 27 '14 at 19:50

Extracting/Exporting the Data of the Empirical Cumulative Distribution Function in R (ecdf)

Example Code

Output

Questions

Update

Solution

2 Answers2