I am trying to create a python script that reads a CSV file that contains data arranged with sample names across the first row and data below each name, as such:
sample1,sample2,sample3
343.323,234.123,312.544
From the dataset I am trying to draw cumulative distribution functions for each sample onto the same axis. Using the code below:
import matplotlib.pyplot as plt
import numpy as np
import csv
def isfloat(value):
'''make sure sample values are floats
(problem with different number of values per sample)'''
try:
float(value)
return True
except ValueError:
return False
def createCDFs (dataset):
'''create a dictionary with sample name as key and data for each
sample as one list per key'''
dataset = dataset
num_headers = len(list(dataset))
dict_CDF = {}
for a in dataset.keys():
dict_CDF["{}".format(a)]= 1. * np.arange(len(dataset[a])) / (len(dataset[a]) - 1)
return dict_CDF
def getdata ():
'''retrieve data from a CSV file - file must have sample names in first row
and data below'''
with open('file.csv') as csvfile:
reader = csv.DictReader(csvfile, delimiter = ',' )
#create a dict that has sample names as key and associated ages as lists
dataset = {}
for row in reader:
for column, value in row.iteritems():
if isfloat(value):
dataset.setdefault(column, []).append(value)
else:
break
return dataset
x = getdata()
y = createCDFs(x)
#plot data
for i in x.keys():
ax1 = plt.subplot(1,1,1)
ax1.plot(x[i],y[i],label=str(i))
plt.legend(loc='upper left')
plt.show()
This gives the output below, which only properly displays one of the samples (Sample1 in Figure 1A).
Figure 1A. Only one CDF is displaying correctly (Sample1). B. Expected output
The number of values per sample differ and I think this is where my problem lies.
This has been really bugging me as I think the solution should be rather simple. Any help/suggestions would be helpful. I simply want to know how I display the data correctly. Data can be found here. The expected output is shown in Figure 1B.