How to conditionally partition a pandas dataframe

Question

I am working on a program to partition a set of data via pandas. This question does not answer my question. The program uses segmentation by natural partitioning. The goal is to

calculate the 5th percentile
calculate the 95th percentile
sort the data
partition the dataset such that only the values from floor(n*0.05) and floor(n*0.95) remain.

I've written a method that process the data. Previously, I was using

def segmentation_by_natural_partitioning(attribute):
    print(attribute.head())
    a = np.array(attribute)

    # calculate 5th and 95th percentiles.
    fith_percentile = np.percentile(a, 5)
    nienty_fith_percentile = np.percentile(a, 95) 

    # sort the data.
    sorted_data = np.sort(a)
    n = a.size
    # keep the values from floor(n*0.05) to floor(n*0.95)
    new_a = split(a, (a > np.math.floor(n*fith_percentile)) & (a < np.math.floor(n*nienty_fith_percentile)))
    return attribute

I'd like to replace

    new_a = split(a, (a > np.math.floor(n*fith_percentile)) & (a < np.math.floor(n*nienty_fith_percentile)))

with

s = s[(s['A2'] > np.math.floor(n*fith_percentile)) an
d (s['A2'] <= np.math.floor(n*nienty_fith_percentile))]

The full program is written like so

from numpy.core.defchararray import count
import pandas as pd
import numpy as np
import numpy as np


def print_full(x):
    pd.set_option('display.max_rows', len(x))
    print(x)
    pd.reset_option('display.max_rows')

def main():
    s = pd.read_csv('A1-dm.csv')
    # entropy_discretization(df['A1'])
    segmentation_by_natural_partitioning(s)

# This method discretizes attribute A1
# If the information gain is 0, i.e the number of 
# distinct class is 1 or
# If min f/ max f < 0.5 and the number of distinct values is floor(n/2)
# Then that partition stops splitting.
def entropy_discretization(s):
    # pick a threshold
    threshold = 6
    print(segmentation_by_natural_partitioning(s))
    print(s.head())


def segmentation_by_natural_partitioning(s):
    a = np.array(s)

    # calculate 5th and 95th percentiles.
    fith_percentile = np.percentile(a, 5)
    nienty_fith_percentile = np.percentile(a, 95) 

    # sort the data.
    sorted_data = np.sort(a)
    n = a.size
    # keep the values from floor(n*0.05) to floor(n*0.95)
    s = s[(s['A2'] > np.math.floor(n*fith_percentile)) and (s['A2'] <= np.math.floor(n*nienty_fith_percentile))]

    return s


main()

A sample of the dataset is provided here

A1,A2,A3,Class
2,0.4631338,1.5,3
8,0.7460648,3.0,3
6,0.264391038,2.5,2
5,0.4406713,2.3,1
2,0.410438159,1.5,3
2,0.302901816,1.5,2
6,0.275869396,2.5,3
8,0.084782428,3.0,3

When I try to run my code I get the following error:

f"The truth value of a {type(self).__name__} is ambiguous. "
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I am specifically looking for a way to partition the dataset via pandas.

score 0 · Answer 1 · answered Oct 18 '21 at 00:23

0

The answer was simple. I just needed to break up the dataframe

 s = s[s['A2'] > fith_percentile]
 s = s[s['A2'] < nienty_fith_percentile]

answered Oct 18 '21 at 00:23

Evan Gertis

1,796
2
25
59

How to conditionally partition a pandas dataframe

1 Answers1