9

I need to read long file with timestamp in seconds, and plot of CDF using numpy or scipy. I did try with numpy but seems the output is NOT what it is supposed to be. The code below: Any suggestions appreciated.

import numpy as np
import matplotlib.pyplot as plt

data = np.loadtxt('Filename.txt')
sorted_data = np.sort(data)
cumulative = np.cumsum(sorted_data)

plt.plot(cumulative)
plt.show()
toasted_flakes
  • 2,502
  • 1
  • 23
  • 38
Phani.lav
  • 153
  • 1
  • 1
  • 10

6 Answers6

19

You have two options:

1: you can bin the data first. This can be done easily with the numpy.histogram function:

import numpy as np
import matplotlib.pyplot as plt

data = np.loadtxt('Filename.txt')

# Choose how many bins you want here
num_bins = 20

# Use the histogram function to bin the data
counts, bin_edges = np.histogram(data, bins=num_bins, normed=True)

# Now find the cdf
cdf = np.cumsum(counts)

# And finally plot the cdf
plt.plot(bin_edges[1:], cdf)

plt.show()

2: rather than use numpy.cumsum, just plot the sorted_data array against the number of items smaller than each element in the array (see this answer for more details https://stackoverflow.com/a/11692365/588071):

import numpy as np

import matplotlib.pyplot as plt

data = np.loadtxt('Filename.txt')

sorted_data = np.sort(data)

yvals=np.arange(len(sorted_data))/float(len(sorted_data)-1)

plt.plot(sorted_data,yvals)

plt.show()

Community
  • 1
  • 1
tmdavison
  • 64,360
  • 12
  • 187
  • 165
  • the second code i implemented but i am a little confused here x = sorted data and y = yvals i am getting a plot that is straight 90 degree angle. I am seriously confused here :( :( and how do i plot a ccdf now based on this .. ?? – Phani.lav Jul 08 '14 at 15:18
  • This works for me. I'm not sure what you mean by 'a plot that is a straight 90 degree angle'. Maybe you could post a sample of your data array? – tmdavison Jul 08 '14 at 17:57
  • 1
    I have a data file with 2 time epoch and i am successful in finding the difference in seconds and i saved that difference in seconds a different file they are in float and only 1 column. so now i have to find the CDF and CCDF of that time difference. – Phani.lav Jul 08 '14 at 18:27
  • 1
    Thanks everything works fine .. I had to put log scale and everything looks fine. – Phani.lav Jul 11 '14 at 08:49
  • 1
    @Rafnuss a tiny bug, visible only for very small data sets: I believe it should be `float(len(sorted_data)-1)` – UriCS Nov 04 '16 at 03:38
7

For completeness, you should also consider:

  • duplicates: you could have the same point more than once in your data.
  • points can have different distances among themselves
  • points can be float

You can use numpy.histogram, setting the bins edges in such a way that each bin collects all the occurrences of only one point. You should keep density=False, because according to the documentation:

Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen

You can normalize instead the number of elements in each bin dividing it by the size of your data.

import numpy as np
import matplotlib.pyplot as plt

def cdf(data):

    data_size=len(data)

    # Set bins edges
    data_set=sorted(set(data))
    bins=np.append(data_set, data_set[-1]+1)

    # Use the histogram function to bin the data
    counts, bin_edges = np.histogram(data, bins=bins, density=False)

    counts=counts.astype(float)/data_size

    # Find the cdf
    cdf = np.cumsum(counts)

    # Plot the cdf
    plt.plot(bin_edges[0:-1], cdf,linestyle='--', marker="o", color='b')
    plt.ylim((0,1))
    plt.ylabel("CDF")
    plt.grid(True)

    plt.show()

As an example, with the following data:

#[ 0.   0.   0.1  0.1  0.2  0.2  0.3  0.3  0.4  0.4  0.6  0.8  1.   1.2]
data = np.concatenate((np.arange(0,0.5,0.1),np.arange(0.6,1.4,0.2),np.arange(0,0.5,0.1)))
cdf(data)

you would get:

CDF


You can also interpolate the cdf in order to get a continuous function (with either a linear interpolation or a cubic spline):

import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d

def cdf(data):

    data_size=len(data)

    # Set bins edges
    data_set=sorted(set(data))
    bins=np.append(data_set, data_set[-1]+1)

    # Use the histogram function to bin the data
    counts, bin_edges = np.histogram(data, bins=bins, density=False)

    counts=counts.astype(float)/data_size

    # Find the cdf
    cdf = np.cumsum(counts)

    x = bin_edges[0:-1]
    y = cdf

    f = interp1d(x, y)
    f2 = interp1d(x, y, kind='cubic')

    xnew = np.linspace(0, max(x), num=1000, endpoint=True)

    # Plot the cdf
    plt.plot(x, y, 'o', xnew, f(xnew), '-', xnew, f2(xnew), '--')
    plt.legend(['data', 'linear', 'cubic'], loc='best')
    plt.title("Interpolation")
    plt.ylim((0,1))
    plt.ylabel("CDF")
    plt.grid(True)

    plt.show()

Interpolation

Amedeo
  • 847
  • 7
  • 17
2

As a quick answer,

plt.plot(sorted_data, np.linspace(0,1,sorted_data.size)

should have got you what you wanted

nayyarv
  • 71
  • 3
2

The following is the step of my implementation:

1.sort your data

2.calculate the cumulative probability of every 'x'

import numpy as np
import matplotlib.pyplab as plt

def cdf(data):
    n = len(data)
    x = np.sort(data) # sort your data
    y = np.arange(1, n + 1) / n # calculate cumulative probability
    return x, y

x_data, y_data = cdf(your_data)
plt.plot(x_data, y_data) 

Example:

test_data = np.random.normal(size= 100)
x_data, y_data = ecdf(test_data)
plt.plot(x_data, y_data, marker= '.', linestyle= 'none')

Figure: The link of graph

  • 2
    While this code snippet may be the solution, [including an explanation](//meta.stackexchange.com/questions/114762/explaining-entirely-‌​code-based-answers) really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. – Lazar Ljubenović Aug 27 '17 at 14:50
  • May I ask what is `ecdf `? And is `pyplab` a typo (should be `pyplot`)? Or do I miss anything here? The code snippet doesn't compile at the first place... – zzy Jan 24 '19 at 15:45
1

Here's an implementation that's a bit more efficient if there are many repeated values (since we only have to sort the unique values). And it plots the CDF as a step function, which it is, strictly speaking.

import sys

import numpy as np
import matplotlib.pyplot as plt

from collections import Counter


def read_data(fp):
    t = []
    for line in fp:
        x = float(line.rstrip())
        t.append(x)
    return t


def main(script, filename=None):
    if filename is None:
        fp = sys.stdin
    else:
        fp = open(filename)

    t = read_data(fp)
    counter = Counter(t)

    xs = counter.keys()
    xs.sort()

    ys = np.cumsum(counter.values()).astype(float)
    ys /= ys[-1]

    options = dict(linewidth=3, alpha=0.5)
    plt.step(xs, ys, where='post', **options)
    plt.xlabel('Values')
    plt.ylabel('CDF')
    plt.show()


if __name__ == '__main__':
    main(*sys.argv)
Allen Downey
  • 134
  • 5
0

If you want can use seaborn library then proceed as follows:

import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv('Filename.txt', sep=" ", header=None)
plt.figure()
sns.kdeplot(data,cumulative=True)
plt.show()
Community
  • 1
  • 1
svmldon
  • 1
  • 1