I'm very new to Python and am working on plotting a graph with matplotlib with values from a csv and am trying to figure out the most efficient way to remove outliers from my lists. The CSV has three variables, x, y, z
, which I've put into separate lists.
I want to find the standard deviation of each list and remove each point that is < or > 2x
stdev (remove the point from each list - x, y, z
, not just one list).
I'm having a hard time figuring out how to efficiently remove a point that is represented in three separate lists while making sure that I don't mix up different data points.
Do I use while loop and delete the value at a certain position for each variable? If so, how would I reference the position in the list where then number is larger than 2x stdev? Thanks!
import matplotlib.pyplot as plt
import csv
import statistics as stat
#making list of each variable
x = []
y = []
z = []
with open('fundata.csv', 'r') as csvfile:
plots = csv.reader(csvfile, delimiter = ',')
#skip the header line in CSV
next(plots)
#import each variable from the CSV file into a list as a float
for row in plots:
x.append(float(row[0]))
y.append(float(row[1]))
z.append(float(row[2]))
#cleaning up the data
stdev_x = stat.stdev(x)
stdev_y = stat.stdev(y)
stdev_z = stat.stdev(z)
print(stdev_x)
print(stdev_y)
print(stdev_z)
#making the graph
fig, ax = plt.subplots()
#make a scatter plot graphing x by y with z variable as color, size of each point is 3
ax.scatter(x, y, c=z, s=3)
#Set chart title and label the axes
ax.set_title("Heatmap of variables", fontsize = 18)
ax.set_xlabel("Var 1", fontsize = 14)
ax.set_ylabel("Var 2", fontsize = 14)
#open Matplotlib viewer
plt.show()
Data set is as follows but is ~35000 rows long with more variability:
var1 | var2 | var3 |
---|---|---|
3876514 | 3875931 | 3875846 |
3876515 | 3875931 | 3875846 |
3876516 | 3875931 | 3875846 |