0

I'm using XGBoost for feature importance, I want to select the features that give me the 90 % of importance, so at first I build a Dataframe beacause I need it for excel and then I write a while cycle to evalutate the features that give me 90% of importances. After this there is a neural network (but it isn't in the code below). I know that maybe there are some easiest way to do this but it gives me an error:

ValueError: could not convert string to float: '0,25691372'

The code is

  import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn import preprocessing

from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from matplotlib import pyplot as plt


dataset = pd.read_csv('CompleteDataSet_original_Clean_CONC.csv', decimal=',', delimiter = ";")
from sklearn.metrics import r2_score

label = dataset.iloc[:,-1]
features = dataset.drop(columns = ['Label'])
y_max_pre_normalize = max(label)
y_min_pre_normalize = min(label)

def denormalize(y):
    final_value = y*(y_max_pre_normalize-y_min_pre_normalize)+y_min_pre_normalize
    return final_value
X_train1, X_test1, y_train1, y_test1 = train_test_split(features, label, test_size = 0.20, random_state = 1, shuffle = True)

y_test2 = y_test1.to_frame()
y_train2 = y_train1.to_frame()

scaler1 = preprocessing.MinMaxScaler()
scaler2 = preprocessing.MinMaxScaler()
X_train = scaler1.fit_transform(X_train1)
X_test = scaler2.fit_transform(X_test1)


scaler3 = preprocessing.MinMaxScaler()
scaler4 = preprocessing.MinMaxScaler()
y_train = scaler3.fit_transform(y_train2)
y_test = scaler4.fit_transform(y_test2)


sel = XGBRegressor(colsample_bytree= 0.7, learning_rate = 0.005, max_depth = 5, min_child_weight = 3, n_estimators = 1000)
sel.fit(X_train, y_train)
importances = sel.feature_importances_

importances = [str(i) for i in importances]

importances = [i.replace(".", ",") for i in importances]

df1 = pd.DataFrame(features.columns)
df1.columns = ['Features']
df2 = pd.DataFrame(importances)
df2.columns = ['Importances [%]']
result = pd.concat([df1,df2],axis = 1)
result = result.sort_values(by='Importances [%]', ascending=False)

result.to_excel("Feature_Results.xlsx") 

i = 0
somma = 0
feature = []
while somma <=0.9:
    a = result.iloc[i,-1]
    somma = float(a) + somma
    feature.append(result.iloc[i,-2])
    i = i + 1
Gabriele Valvo
  • 196
  • 2
  • 12
  • 4
    Replace the `,` with a `.`? – Adam Feb 09 '20 at 20:27
  • 1
    https://stackoverflow.com/q/7106417/1324033 – Sayse Feb 09 '20 at 20:32
  • 1
    I've only skimmed the code, but it seems like `str(i).replace(".", ",") for i in importances` is the root of the problem. Why are you doing that? – wjandrea Feb 09 '20 at 20:38
  • 1
    Yes, the problem is this. I did it because I need to export this dataframe into excel. So I prefere to have the "," as decimal separator – Gabriele Valvo Feb 09 '20 at 20:40
  • Right, Excel is locale-specific. I see that now, `result.to_excel()` – wjandrea Feb 09 '20 at 20:41
  • @wjandrea I simply delete the two line where I use .replace and it works. But how can I change the dot in a comma without re-doing the DataFrameafter the while cycle? – Gabriele Valvo Feb 09 '20 at 20:55
  • Is there a smarter way to do what I do in the while cycle? – Gabriele Valvo Feb 09 '20 at 21:10
  • 2
    As an aside, why do you assign column names after creating DataFrames? It doesn't make much sense to me to do `df1 = pd.DataFrame(features.columns); df1.columns = ['Features']` instead of `df_1 = pd.DataFrame(features.columns, columns=['Features'])`. Same thing with the two different list comprehensions to for `importances,` that should be trivial to change. – AMC Feb 09 '20 at 21:31

3 Answers3

3
float('0,25691372'.replace(",", "."))
schlodinger
  • 537
  • 3
  • 14
2

Try to convert "0,0001" into "0.0001" and then convert the string to float.

notarealgreal
  • 734
  • 16
  • 29
2

You could use locale.atof() to handle , being used as the decimal separator.

import locale
locale.setlocale(locale.LC_ALL, 'fr_FR')
...
    somma = locale.atof(a) + somma
rdas
  • 20,604
  • 6
  • 33
  • 46
  • 1
    And the reason for using atof is that the decimal separator depends on the locale - and the printed version given in the question is for a locale that uses `,` instead of `.` - and python can't parse that directly when calling `float` - as it expects `.` instead. – MatsLindh Feb 09 '20 at 20:34
  • I tried your solution but it doesn't work. The error is Error: unsupported locale setting. How can I fix it? – Gabriele Valvo Feb 09 '20 at 20:58
  • You probably don't have the locale installed in your machine. You can check with `locale -a` and install it https://stackoverflow.com/questions/14547631/python-locale-error-unsupported-locale-setting/37112094 – rdas Feb 09 '20 at 21:02
  • @GabrieleValvo fr_FR is just one possible choice. Use whatever locale you're using, whether that's it_IT (Italian), de_DE (German), de_CH (Swiss German)... – wjandrea Feb 09 '20 at 21:02
  • Yes, but I don't know how to install it. I have to use the prompt? – Gabriele Valvo Feb 09 '20 at 21:08
  • It depends on your OS. – rdas Feb 09 '20 at 21:10
  • I'm using Zorin – Gabriele Valvo Feb 09 '20 at 21:17
  • Zorin is based on Ubuntu. Check out this: https://askubuntu.com/questions/76013/how-do-i-add-locale-to-ubuntu-server – rdas Feb 09 '20 at 21:22