Python module providing a bridge between Scikit-Learn’s Machine Learning methods and pandas-style DataFrames
Questions tagged [sklearn-pandas]
1336 questions
98
votes
6 answers
How to one-hot-encode from a pandas column containing a list?
I would like to break down a pandas column consisting of a list of elements into as many columns as there are unique elements i.e. one-hot-encode them (with value 1 representing a given element existing in a row and 0 in the case of absence).
For…

Melsauce
- 2,535
- 2
- 19
- 39
42
votes
4 answers
Sklearn plot_tree plot is too small
I have this simple code:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, y)
tree.plot_tree(clf.fit(X, y))
plt.show()
And the result I get is this graph:
How do I make this graph legible? I'm using PyCharm Professional 2019.3 as my IDE.

Artur
- 614
- 1
- 6
- 9
28
votes
4 answers
sklearn stratified sampling based on a column
I have a fairly large CSV file containing amazon review data which I read into a pandas data frame. I want to split the data 80-20(train-test) but while doing so I want to ensure that the split data is proportionally representing the values of one…

Azee.
- 703
- 1
- 5
- 12
26
votes
2 answers
python sklearn multiple linear regression display r-squared
I calculated my multiple linear regression equation and I want to see the adjusted R-squared. I know that the score function allows me to see r-squared, but it is not adjusted.
import pandas as pd #import the pandas module
import numpy as np
df =…

jeangelj
- 4,338
- 16
- 54
- 98
23
votes
3 answers
Using K-means with cosine similarity - Python
I am trying to implement Kmeans algorithm in python which will use cosine distance instead of euclidean distance as distance metric.
I understand that using different distance function can be fatal and should done carefully. Using cosine distance…

ise372
- 231
- 1
- 2
- 5
18
votes
2 answers
Multivariable/Multiple Linear Regression in Scikit Learn?
I have a dataset (dataTrain.csv & dataTest.csv) in .csv file with this format:
Temperature(K),Pressure(ATM),CompressibilityFactor(Z)
273.1,24.675,0.806677258
313.1,24.675,0.888394713
...,...,...
And able to build a regression model and prediction…

Drizzer Silverberg
- 193
- 1
- 1
- 7
17
votes
4 answers
Scikit K-means clustering performance measure
I'm trying to do a clustering with K-means method but I would like to measure the performance of my clustering.
I'm not an expert but I am eager to learn more about clustering.
Here is my code :
import pandas as pd
from sklearn import…

Viphone Rathikoun
- 187
- 1
- 1
- 5
17
votes
6 answers
ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 0.0
I have applied Logistic Regression on train set after splitting the data set into test and train sets, but I got the above error. I tried to work it out, and when i tried to print my response vector y_train in the console it prints integer values…

Amey Kumar Samala
- 904
- 1
- 7
- 20
17
votes
4 answers
No module named 'pandas' in Pycharm
I read all the topics about, but I cannot solve my problem:
Traceback (most recent call last):
File "/home/.../.../.../reading_data.py", line 1, in
import pandas as pd
ImportError: No module named pandas
This is my…

ElenaPhys
- 443
- 2
- 5
- 16
16
votes
2 answers
How to normalize the Train and Test data using MinMaxScaler sklearn
So, I have this doubt and have been looking for answers. So the question is when I use,
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
df =…

Tia
- 521
- 2
- 6
- 18
16
votes
1 answer
'DataFrame' object has no attribute 'ravel' when transforming target variable?
I was fitting a logistic regression with a subset dataset. After splitting the dataset and fitting the model, I got a error message of the following:
/Users/Eddie/anaconda/lib/python3.4/site-packages/sklearn/utils/validation.py:526:…

Edward Lin
- 609
- 1
- 9
- 16
16
votes
1 answer
use Featureunion in scikit-learn to combine two pandas columns for tfidf
While using this as a model for spam classification, I'd like to add an additional feature of the Subject plus the body.
I have all of my features in a pandas dataframe. For example, the subject is df['Subject'], the body is df['body_text'] and the…

BLodge
- 163
- 1
- 1
- 4
15
votes
4 answers
What is the difference between X_test, X_train, y_test, y_train in sklearn?
I'm learning sklearn and I didn't understand very good the difference and why use 4 outputs with the function train_test_split().
In the Documentation, I found some examples but it wasn't sufficient to end my doubts.
Does the code use the X_train to…

Jancer Lima
- 744
- 2
- 10
- 19
14
votes
3 answers
Append tfidf to pandas dataframe
I have the following pandas structure:
col1 col2 col3 text
1 1 0 meaningful text
5 9 7 trees
7 8 2 text
I'd like to vectorise it using a tfidf vectoriser. This, however, returns a parse matrix, which I can actually turn…

lte__
- 7,175
- 25
- 74
- 131
14
votes
2 answers
How to load Only column names from csv file (Pandas)?
I have a large csv file and don't want to load it fully into my memory, I need to get only column names from this csv file. How to load it clearly?

Ivan Shelonik
- 1,958
- 5
- 25
- 49