How to plot several cumulative distribution functions from data in a CSV file in Python?

Question

I am trying to create a python script that reads a CSV file that contains data arranged with sample names across the first row and data below each name, as such:

sample1,sample2,sample3
343.323,234.123,312.544

From the dataset I am trying to draw cumulative distribution functions for each sample onto the same axis. Using the code below:

import matplotlib.pyplot as plt
import numpy as np
import csv


def isfloat(value):
    '''make sure sample values are floats
    (problem with different number of values per sample)'''
    try:
      float(value)
      return True
    except ValueError:
      return False

def createCDFs (dataset):
    '''create a dictionary with sample name as key and data for each
    sample as one list per key'''
    dataset = dataset
    num_headers = len(list(dataset))
    dict_CDF = {}
    for a in dataset.keys():
        dict_CDF["{}".format(a)]= 1. * np.arange(len(dataset[a])) / (len(dataset[a]) - 1)
    return dict_CDF

def getdata ():
    '''retrieve data from a CSV file - file must have sample names in first row
    and data below'''

    with open('file.csv') as csvfile:
        reader = csv.DictReader(csvfile, delimiter = ',' )
        #create a dict that has sample names as key and associated ages as lists
        dataset = {}
        for row in reader:
            for column, value in row.iteritems():
                if isfloat(value):
                    dataset.setdefault(column, []).append(value)
                else:
                    break
        return dataset

x = getdata()
y = createCDFs(x)

#plot data
for i in x.keys():
    ax1 = plt.subplot(1,1,1)
    ax1.plot(x[i],y[i],label=str(i))


plt.legend(loc='upper left')
plt.show()

This gives the output below, which only properly displays one of the samples (Sample1 in Figure 1A).

Figure 1A. Only one CDF is displaying correctly (Sample1). B. Expected output

The number of values per sample differ and I think this is where my problem lies.

This has been really bugging me as I think the solution should be rather simple. Any help/suggestions would be helpful. I simply want to know how I display the data correctly. Data can be found here. The expected output is shown in Figure 1B.

I have added an image of the expected output as generated in Excel — Ton, Nov 07 '16 at 19:54
I still only see the previous image, should there be more than one link? — user2699, Nov 07 '16 at 19:57
I'm guessing you aren't reading all the data from the CSV, check x.shape and see that it is what you expect. — user2699, Nov 07 '16 at 20:22
I am reading all the data from the CSV, I checked it all as I went along. The problem is not in importing the data, although I seem to be importing white space for the two samples that do not contained less data than the other sample - which could be part of the problem. It could also be in the `createCDFs` function or in the way that I am plotting the data. — Ton, Nov 07 '16 at 23:28

score 0 · Accepted Answer · edited May 23 '17 at 12:01

0

Here is a simpler approach. That of course depends on if you want to use pandas. I used this approach to calculate cum dist

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 

data_req = pd.read_table("yourfilepath", sep=",")
#sort values per column
sorted_values = data_req.apply(lambda x: x.sort_values())

#plot with matplotlib
#note that you have to drop the Na's on columns to have appropriate
#dimensions per variable.

for col in sorted_values.columns: 
    y = np.linspace(0.,1., len(sorted_values[col].dropna()))
    plt.plot(sorted_values[col].dropna(), y)

In the end, I got the figure you were looking for:

edited May 23 '17 at 12:01

Community

1
1

answered Nov 08 '16 at 11:10

josecoto

732
1
7
15

Great! Thanks very much. This works well EXCEPT the data sorting does not seem to be working. I added an unsorted sample to the CSV and your code did not sort the added sample. But it worked once I sorted the original data. Any ideas? – Ton Nov 08 '16 at 11:57
Also. I'm fairly noob when it comes to Python so I didn't actually know about the panda package - so thank you for that! – Ton Nov 08 '16 at 11:58
I found the solution to the sorting problem. I replaced your sorting code with the following: `arr = data_req.values` `arr.sort(axis=0)` `data_req = pd.DataFrame(arr, index=data_req.index, columns=data_req.columns)` – Ton Nov 08 '16 at 12:43

How to plot several cumulative distribution functions from data in a CSV file in Python?

1 Answers1