0

Is this the best way to work with pandas and vectorizer ? Converting a dataframe to a dict, vectorize and put all in a new dataframe? Or there is a better way to work with?

import pandas as pd

# Putting AmesHousing.txt data into a dataframe
data = pd.read_csv('AmesHousing.txt', encoding='UTF-8', delimiter='\t')
data = data.fillna(0)


from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False, dtype=int)

df = pd.DataFrame(vec.fit_transform(data.T.to_dict().values()), columns = [vec.get_feature_names()]) 


from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

#Here we are splitting our data with 2 pieces: train and test. Test will have 33% of data; train will have all the rest
test, train = train_test_split(df,test_size=0.33, random_state=42)

model = LinearRegression()
model.fit(train.drop(['SalePrice'], axis=1), train[['SalePrice']])

predict = model.predict(test.drop(['SalePrice'], axis=1))

MSE = mean_squared_error(predict,test[['SalePrice']])
RMSE = np.sqrt(MSE) 
print('MSE:',MSE,'RMSE:',RMSE)
Ente
  • 2,301
  • 1
  • 16
  • 34
Marcel Pinheiro
  • 423
  • 4
  • 7
  • Please share with us what the properties of a possible better solution would be. More readable? Faster? More maintainable? ... – Ente Dec 03 '19 at 21:10
  • It also depends on what input data you are trying to vectorize. If you already have a dataframe, unless the dataframe f\values are `dict` type, then there's probably a better vectorizer to use for your task. Have a look at [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and provide a [mcve] – G. Anderson Dec 03 '19 at 21:16
  • 1
    Worth noting: If you're using the [ames housing dataset](https://www.kaggle.com/prevek18/ames-housing-dataset), then this is 100% (well, 99%) not the correct way to do feature extraction. If I'm reading this correctly, you're basically one-hot/categorically encoding every single feature regardless of data type or value. Looking at the outputs, how is your vectorized DF different than the original DF? – G. Anderson Dec 03 '19 at 21:35
  • I tried the pandas corr() function to know what feature had more correlation with my target and used that features. Later i changed my code to this one. Is this a right way to work? Im also looking for a entrophy solution – Marcel Pinheiro Dec 03 '19 at 23:46

0 Answers0