I am working on a program to partition a set of data via pandas. This question does not answer my question. The program uses segmentation by natural partitioning. The goal is to
- calculate the 5th percentile
- calculate the 95th percentile
- sort the data
- partition the dataset such that only the values from
floor(n*0.05)
andfloor(n*0.95)
remain.
I've written a method that process the data. Previously, I was using
def segmentation_by_natural_partitioning(attribute):
print(attribute.head())
a = np.array(attribute)
# calculate 5th and 95th percentiles.
fith_percentile = np.percentile(a, 5)
nienty_fith_percentile = np.percentile(a, 95)
# sort the data.
sorted_data = np.sort(a)
n = a.size
# keep the values from floor(n*0.05) to floor(n*0.95)
new_a = split(a, (a > np.math.floor(n*fith_percentile)) & (a < np.math.floor(n*nienty_fith_percentile)))
return attribute
I'd like to replace
new_a = split(a, (a > np.math.floor(n*fith_percentile)) & (a < np.math.floor(n*nienty_fith_percentile)))
with
s = s[(s['A2'] > np.math.floor(n*fith_percentile)) an
d (s['A2'] <= np.math.floor(n*nienty_fith_percentile))]
The full program is written like so
from numpy.core.defchararray import count
import pandas as pd
import numpy as np
import numpy as np
def print_full(x):
pd.set_option('display.max_rows', len(x))
print(x)
pd.reset_option('display.max_rows')
def main():
s = pd.read_csv('A1-dm.csv')
# entropy_discretization(df['A1'])
segmentation_by_natural_partitioning(s)
# This method discretizes attribute A1
# If the information gain is 0, i.e the number of
# distinct class is 1 or
# If min f/ max f < 0.5 and the number of distinct values is floor(n/2)
# Then that partition stops splitting.
def entropy_discretization(s):
# pick a threshold
threshold = 6
print(segmentation_by_natural_partitioning(s))
print(s.head())
def segmentation_by_natural_partitioning(s):
a = np.array(s)
# calculate 5th and 95th percentiles.
fith_percentile = np.percentile(a, 5)
nienty_fith_percentile = np.percentile(a, 95)
# sort the data.
sorted_data = np.sort(a)
n = a.size
# keep the values from floor(n*0.05) to floor(n*0.95)
s = s[(s['A2'] > np.math.floor(n*fith_percentile)) and (s['A2'] <= np.math.floor(n*nienty_fith_percentile))]
return s
main()
A sample of the dataset is provided here
A1,A2,A3,Class
2,0.4631338,1.5,3
8,0.7460648,3.0,3
6,0.264391038,2.5,2
5,0.4406713,2.3,1
2,0.410438159,1.5,3
2,0.302901816,1.5,2
6,0.275869396,2.5,3
8,0.084782428,3.0,3
When I try to run my code I get the following error:
f"The truth value of a {type(self).__name__} is ambiguous. "
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I am specifically looking for a way to partition the dataset via pandas.