1

Dataset: https://raw.githubusercontent.com/Kuntal-G/Machine-Learning/master/R-machine-learning/data/banknote-authentication.csv

How can I calculate the conditional entropy and find the best information gain from a dataset like this? enter image description here

The code for calculating entropy:

def entropy(column):
""" Calculates the entropy"""
values, counts = np.unique(column, return_counts=True)
entropy_val = 0
for i in range(len(counts)):
    entropy_val += (
            (-counts[i] / sum(counts)) * math.log2(counts[i] / (sum(counts)))
    )
    
return entropy_val

where 'column' is a feature in the dataframe, for example df[0]. I'm a little stuck as to where to go from here... Can anyone point me in the right direction, where my end goal is finding best information gain.

entropy_vals = {}
entropy_vals = entropy(X[0]), entropy(X[1]), entropy(X[2]), entropy(X[3]), entropy(y)

print(entropy_vals)

enter image description here

    df = pd.read_csv('data_banknote_authentication.txt', header=None)
print(df)


y = df.iloc[:, -1]
X = df.iloc[:, :4]


def count_labels(rows):
    """Counts number of each unique value in selected column."""
    counts = {}
    for row in rows:
        label = row
        if label not in counts:
            counts[label] = 1
        else:
            counts[label] += 1
    return counts


def entropy(column):
    """ Calculates the entropy"""
    values, counts = np.unique(column, return_counts=True)
    entropy_val = 0
    for i in range(len(counts)):
        entropy_val += (
                (-counts[i] / sum(counts)) * math.log2(counts[i] / (sum(counts)))
        )

    return entropy_val


entropy_vals = {}
entropy_vals = entropy(X[0]), entropy(X[1]), entropy(X[2]), entropy(X[3]), entropy(y)

print(entropy_vals)


def check_unique(data):
    label_col = data[data.columns[-1]]
    print(label_col)
    unique_features = np.unique(label_col)
    if len(unique_features) == 1:
        return True
    else:
        return False


def categorize_data(data):
    label_col = data[data.columns[-1]]
    values, counts = np.unique(label_col, return_counts=True)
    print(values, counts)
    index = counts.argmax()
    category = values[index]

    return category



def split(data):
    x_less = data[data <= np.mean(data)]
    x_greater = data[data > np.mean(data)]

    return x_less, x_greater
crissb3
  • 57
  • 1
  • 9
  • post data as reproducible code. You try to calculate entropy over each column? – Zaraki Kenpachi Sep 18 '20 at 11:49
  • Added entropy over each column. I am trying now to find information gain from each column. I want to use mean to split, but am not sure how to go on from here. – crissb3 Sep 18 '20 at 12:00
  • how this entropy need to by calculated? Like above or another solution? – Zaraki Kenpachi Sep 18 '20 at 12:11
  • Trying to do something like this: https://stackoverflow.com/questions/25462407/fast-information-gain-computation But not sure how to implement my code into that – crissb3 Sep 18 '20 at 12:18

0 Answers0