3

I have dataset quantized it to 10 levels by Python and looks like:

9 9 1 8 9 1

1 9 3 6 1 0

8 3 8 4 4 1

0 2 1 9 9 0

This means the component (9 9 1 8 9) belongs to class 1. I want to find the Entropy of each feature(column). I wrote the following code but it has many errors:

import pandas as pd
import math

f = open ( 'data1.txt' , 'r')

# Finding the probability
df = pd.DataFrame(pd.read_csv(f, sep='\t', header=None, names=['val1', 
    'val2', 'val3', 'val4','val5', 'val6', 'val7', 'val8']))
df.loc[:,"val1":"val5"] = df.loc[:,"val1":"val5"].div(df.sum(axis=0), 
    axis=1)

# Calculating Entropy
def shannon(col):
    entropy = - sum([ p * math.log(p) / math.log(2.0) for p in col])
    return entropy

sh_df = df.loc[:,'val1':'val5'].apply(shannon,axis=0)

Can you correct my code or do you know any function for finding the Entropy of each column of a dataset in Python?

Gonzalo Garcia
  • 6,192
  • 2
  • 29
  • 32
Amir
  • 169
  • 1
  • 3
  • 12
  • 2
    refer this answer please https://stackoverflow.com/questions/15450192/fastest-way-to-compute-entropy-in-python scipy already has formula for entropy – Aritesh Apr 06 '18 at 05:48
  • Please consider [accepting an answer](https://stackoverflow.com/help/someone-answers). If you find [no answer satisfactory](https://stackoverflow.com/help/no-one-answers), please consider editing your question(s) to provide more information. If you want to motivate answerers, please consider [starting a bounty](https://meta.stackexchange.com/questions/16065/). Accepting an answer shows your appreciation, rewards the author, provides incentive to others and informs everyone that your issue is resolved. You can always change your mind and accept a different answer later on. – marianoju Jan 27 '22 at 19:58

1 Answers1

6

You can find column's entropy in pandas with the following script

import numpy as np
from math import e
import pandas as pd   

""" Usage: pandas_entropy(df['column1']) """

def pandas_entropy(column, base=None):
    vc = pd.Series(column).value_counts(normalize=True, sort=False)
    base = e if base is None else base
    return -(vc * np.log(vc)/np.log(base)).sum()

Just run the previous function for each column and it will return each entropy.

This answer was inspired by this one

marianoju
  • 334
  • 4
  • 15
Gonzalo Garcia
  • 6,192
  • 2
  • 29
  • 32
  • `scipy.stats.entropy` and `math.log` are not required. What is the definition or unit of your entropy measure? Shannon? Your link points to a question, not any specific answer. – marianoju Jan 27 '22 at 11:03