0

I'm very new to Python and am working on plotting a graph with matplotlib with values from a csv and am trying to figure out the most efficient way to remove outliers from my lists. The CSV has three variables, x, y, z, which I've put into separate lists.

I want to find the standard deviation of each list and remove each point that is < or > 2x stdev (remove the point from each list - x, y, z, not just one list).

I'm having a hard time figuring out how to efficiently remove a point that is represented in three separate lists while making sure that I don't mix up different data points.

Do I use while loop and delete the value at a certain position for each variable? If so, how would I reference the position in the list where then number is larger than 2x stdev? Thanks!

import matplotlib.pyplot as plt
import csv
import statistics as stat

#making list of each variable
x = []
y = []
z = []

with open('fundata.csv', 'r') as csvfile:
    plots = csv.reader(csvfile, delimiter = ',')
    
    #skip the header line in CSV
    next(plots)

#import each variable from the CSV file into a list as a float
    for row in plots:
        x.append(float(row[0]))
        y.append(float(row[1]))
        z.append(float(row[2]))


#cleaning up the data
stdev_x = stat.stdev(x)
stdev_y = stat.stdev(y)
stdev_z = stat.stdev(z)

print(stdev_x)
print(stdev_y)
print(stdev_z)



#making the graph
fig, ax = plt.subplots()

#make a scatter plot graphing x by y with z variable as color, size of each point is 3
ax.scatter(x, y, c=z, s=3)

#Set chart title and label the axes
ax.set_title("Heatmap of variables", fontsize = 18)
ax.set_xlabel("Var 1", fontsize = 14)
ax.set_ylabel("Var 2", fontsize = 14)

#open Matplotlib viewer
plt.show()

Data set is as follows but is ~35000 rows long with more variability:

var1 var2 var3
3876514 3875931 3875846
3876515 3875931 3875846
3876516 3875931 3875846
MaartenDev
  • 5,631
  • 5
  • 21
  • 33

1 Answers1

1

It is nearly always easier to use pandas to deal with data of this kind. Calculate the row-wise means and standard deviations, then select values within the required range. The outliers will be replaced with missing values. You can then use dropna to drop all the rows that contain missing values.

import pandas as pd
df = pd.read_csv("fundata.csv", names=["x", "y", "z"])
mean = df.mean(axis=0)
std = df.std(axis=0)
edited = df[(mean - 2 * std <= df) & (df <= mean + 2 * std)].dropna()

Alternatively, use scipy.stats.zscore, which will do the calculation for you:

from scipy.stats import zscore
...
edited = df[(abs(zscore(df)) <= 2).all(axis=1)]

If you want to avoid pandas for some reason, then one way would be to replace all the outliers within each column with None:

def replace_outliers(values):
    mean = statistics.mean(values)
    stdev = statistics.stdev(values)
    for v in values:
        if mean - 2 * stdev <= v <= mean + 2 * stdev:
            yield v
        else:
            yield None

x, y, z = [replace_outliers(column) for column in [x, y, z]]

Then zip the columns together and select rows that do not contain None:

selected_rows = [row for row in zip(x, y, z) if not None in row]

Finally if needed you can zip the rows together to transpose the data back into three column lists:

x, y, z = zip(*selected_rows)
Stuart
  • 9,597
  • 1
  • 21
  • 30