I have a csv file of fish occurrences and need to trim out any fish that show up only once, and then output this as a 'trimmed' csv. However, the function I am using adds a headerless column to the trimmed csv, which messes up further calculations I need to do with the trimmed file.
The column includes row numbers from to_keep
and I believe is created as a result of this line: return df[df[colname].isin(to_keep)]
. I would like to have this script simply not create this column; otherwise I have no manually delete it from every single csv file I trim!
import pandas as pd
def trim_single_entries(fn, colname):
# remove all entries where colname's entry is unique to one row across the whole file
df = pd.read_csv(fn)
if colname in df.columns:
counts = df[colname].value_counts()
to_keep = [counts.index[i] for i in range(0,len(counts)) if counts.values[i] > 1]
return df[df[colname].isin(to_keep)]
else:
return False
x = trim_single_entries('fish_data.csv', 'catalognumber')
x.to_csv('trimmed_fish_data.csv')