0

I am working on a program to partition a set of data via pandas. This question does not answer my question. The program uses segmentation by natural partitioning. The goal is to

  1. calculate the 5th percentile
  2. calculate the 95th percentile
  3. sort the data
  4. partition the dataset such that only the values from floor(n*0.05) and floor(n*0.95) remain.

I've written a method that process the data. Previously, I was using

def segmentation_by_natural_partitioning(attribute):
    print(attribute.head())
    a = np.array(attribute)

    # calculate 5th and 95th percentiles.
    fith_percentile = np.percentile(a, 5)
    nienty_fith_percentile = np.percentile(a, 95) 

    # sort the data.
    sorted_data = np.sort(a)
    n = a.size
    # keep the values from floor(n*0.05) to floor(n*0.95)
    new_a = split(a, (a > np.math.floor(n*fith_percentile)) & (a < np.math.floor(n*nienty_fith_percentile)))
    return attribute

I'd like to replace

    new_a = split(a, (a > np.math.floor(n*fith_percentile)) & (a < np.math.floor(n*nienty_fith_percentile)))

with

s = s[(s['A2'] > np.math.floor(n*fith_percentile)) an
d (s['A2'] <= np.math.floor(n*nienty_fith_percentile))]

The full program is written like so

from numpy.core.defchararray import count
import pandas as pd
import numpy as np
import numpy as np


def print_full(x):
    pd.set_option('display.max_rows', len(x))
    print(x)
    pd.reset_option('display.max_rows')

def main():
    s = pd.read_csv('A1-dm.csv')
    # entropy_discretization(df['A1'])
    segmentation_by_natural_partitioning(s)

# This method discretizes attribute A1
# If the information gain is 0, i.e the number of 
# distinct class is 1 or
# If min f/ max f < 0.5 and the number of distinct values is floor(n/2)
# Then that partition stops splitting.
def entropy_discretization(s):
    # pick a threshold
    threshold = 6
    print(segmentation_by_natural_partitioning(s))
    print(s.head())


def segmentation_by_natural_partitioning(s):
    a = np.array(s)

    # calculate 5th and 95th percentiles.
    fith_percentile = np.percentile(a, 5)
    nienty_fith_percentile = np.percentile(a, 95) 

    # sort the data.
    sorted_data = np.sort(a)
    n = a.size
    # keep the values from floor(n*0.05) to floor(n*0.95)
    s = s[(s['A2'] > np.math.floor(n*fith_percentile)) and (s['A2'] <= np.math.floor(n*nienty_fith_percentile))]

    return s


main()

A sample of the dataset is provided here

A1,A2,A3,Class
2,0.4631338,1.5,3
8,0.7460648,3.0,3
6,0.264391038,2.5,2
5,0.4406713,2.3,1
2,0.410438159,1.5,3
2,0.302901816,1.5,2
6,0.275869396,2.5,3
8,0.084782428,3.0,3

When I try to run my code I get the following error:

f"The truth value of a {type(self).__name__} is ambiguous. "
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I am specifically looking for a way to partition the dataset via pandas.

halfer
  • 19,824
  • 17
  • 99
  • 186
Evan Gertis
  • 1,796
  • 2
  • 25
  • 59

1 Answers1

0

The answer was simple. I just needed to break up the dataframe

 s = s[s['A2'] > fith_percentile]
 s = s[s['A2'] < nienty_fith_percentile]
Evan Gertis
  • 1,796
  • 2
  • 25
  • 59