0

I have a csv file with words and their tf-idf scores. I am writing a method to normalize the values (to make them between 0 and 1 ). I am using Pandas library of python. The data is read as dataframe object of Pandas. When I try to run the code, I get an exception-"ValueError: too many boolean indices". Could you please tell me what is going wrong. I went through a couple of answers on multiple forums, but could not relate to what I am facing.

This is the line where I get the error: dtm_norm=(dtm-min)/(diffMaxMin)

This is the data format-

    index   0
0   abbaiah 0.121030858
1   abbaiah_reddi   0.121030858
2   abbaiah_reddi_kaggadasapura 0.121030858

This is the code:

def normalizeValues(inputpath):
    outputpath=inputpath+'normalized\\'

    allFiles =  glob.glob(inputpath+"\\*.csv")
    for file in allFiles:
        fileName=file.split('\\')[-1:][0]
        dtm=pd.read_csv(file)
        min=dtm.min(numeric_only='true')
        max=dtm.max(numeric_only='true')
        diffMaxMin=max-min
        dtm_norm=(dtm-min)/(diffMaxMin)
        writeToCsv(dtm_norm,outputpath+fileName)
pnv
  • 1,437
  • 3
  • 23
  • 52
  • 1
    Don't know why you get that error but did you look at this related question: http://stackoverflow.com/questions/12525722/normalize-data-in-pandas and also there is a method in sklearn: http://scikit-learn.org/stable/modules/preprocessing.html#scaling-features-to-a-range – EdChum Mar 18 '15 at 09:44
  • Yes, I wrote the code referring to the question you suggested. But, I am using min-max normalization – pnv Mar 18 '15 at 09:45
  • Something I did notice is that you are not filtering your columns based on dtype, your min. max and diffMaxMin are performed on numeric only columns but you then subtract from `dtm` the `min` df, but `dtm` is your original df which has not been filtered, could this be the problem? – EdChum Mar 18 '15 at 09:49
  • Might be, I am not sure...I will try out and let you know. Thanks for the suggestion. – pnv Mar 18 '15 at 09:50
  • You could try this `dtm_norm=(dtm[min.columns]-min)/(diffMaxMin)` so that you select the same columns – EdChum Mar 18 '15 at 09:52
  • Another possibility is that your `dtm` df has all its original rows and somehow your `min` and `max` dfs have a different number of rows, can you check whether this is the case – EdChum Mar 18 '15 at 10:51

0 Answers0