Questions tagged [sklearn-pandas]

Python module providing a bridge between Scikit-Learn’s Machine Learning methods and pandas-style DataFrames

Resources

1336 questions
98
votes
6 answers

How to one-hot-encode from a pandas column containing a list?

I would like to break down a pandas column consisting of a list of elements into as many columns as there are unique elements i.e. one-hot-encode them (with value 1 representing a given element existing in a row and 0 in the case of absence). For…
Melsauce
  • 2,535
  • 2
  • 19
  • 39
42
votes
4 answers

Sklearn plot_tree plot is too small

I have this simple code: clf = tree.DecisionTreeClassifier() clf = clf.fit(X, y) tree.plot_tree(clf.fit(X, y)) plt.show() And the result I get is this graph: How do I make this graph legible? I'm using PyCharm Professional 2019.3 as my IDE.
Artur
  • 614
  • 1
  • 6
  • 9
28
votes
4 answers

sklearn stratified sampling based on a column

I have a fairly large CSV file containing amazon review data which I read into a pandas data frame. I want to split the data 80-20(train-test) but while doing so I want to ensure that the split data is proportionally representing the values of one…
Azee.
  • 703
  • 1
  • 5
  • 12
26
votes
2 answers

python sklearn multiple linear regression display r-squared

I calculated my multiple linear regression equation and I want to see the adjusted R-squared. I know that the score function allows me to see r-squared, but it is not adjusted. import pandas as pd #import the pandas module import numpy as np df =…
jeangelj
  • 4,338
  • 16
  • 54
  • 98
23
votes
3 answers

Using K-means with cosine similarity - Python

I am trying to implement Kmeans algorithm in python which will use cosine distance instead of euclidean distance as distance metric. I understand that using different distance function can be fatal and should done carefully. Using cosine distance…
ise372
  • 231
  • 1
  • 2
  • 5
18
votes
2 answers

Multivariable/Multiple Linear Regression in Scikit Learn?

I have a dataset (dataTrain.csv & dataTest.csv) in .csv file with this format: Temperature(K),Pressure(ATM),CompressibilityFactor(Z) 273.1,24.675,0.806677258 313.1,24.675,0.888394713 ...,...,... And able to build a regression model and prediction…
Drizzer Silverberg
  • 193
  • 1
  • 1
  • 7
17
votes
4 answers

Scikit K-means clustering performance measure

I'm trying to do a clustering with K-means method but I would like to measure the performance of my clustering. I'm not an expert but I am eager to learn more about clustering. Here is my code : import pandas as pd from sklearn import…
17
votes
6 answers

ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 0.0

I have applied Logistic Regression on train set after splitting the data set into test and train sets, but I got the above error. I tried to work it out, and when i tried to print my response vector y_train in the console it prints integer values…
17
votes
4 answers

No module named 'pandas' in Pycharm

I read all the topics about, but I cannot solve my problem: Traceback (most recent call last): File "/home/.../.../.../reading_data.py", line 1, in import pandas as pd ImportError: No module named pandas This is my…
ElenaPhys
  • 443
  • 2
  • 5
  • 16
16
votes
2 answers

How to normalize the Train and Test data using MinMaxScaler sklearn

So, I have this doubt and have been looking for answers. So the question is when I use, from sklearn import preprocessing min_max_scaler = preprocessing.MinMaxScaler() df =…
16
votes
1 answer

'DataFrame' object has no attribute 'ravel' when transforming target variable?

I was fitting a logistic regression with a subset dataset. After splitting the dataset and fitting the model, I got a error message of the following: /Users/Eddie/anaconda/lib/python3.4/site-packages/sklearn/utils/validation.py:526:…
Edward Lin
  • 609
  • 1
  • 9
  • 16
16
votes
1 answer

use Featureunion in scikit-learn to combine two pandas columns for tfidf

While using this as a model for spam classification, I'd like to add an additional feature of the Subject plus the body. I have all of my features in a pandas dataframe. For example, the subject is df['Subject'], the body is df['body_text'] and the…
BLodge
  • 163
  • 1
  • 1
  • 4
15
votes
4 answers

What is the difference between X_test, X_train, y_test, y_train in sklearn?

I'm learning sklearn and I didn't understand very good the difference and why use 4 outputs with the function train_test_split(). In the Documentation, I found some examples but it wasn't sufficient to end my doubts. Does the code use the X_train to…
14
votes
3 answers

Append tfidf to pandas dataframe

I have the following pandas structure: col1 col2 col3 text 1 1 0 meaningful text 5 9 7 trees 7 8 2 text I'd like to vectorise it using a tfidf vectoriser. This, however, returns a parse matrix, which I can actually turn…
lte__
  • 7,175
  • 25
  • 74
  • 131
14
votes
2 answers

How to load Only column names from csv file (Pandas)?

I have a large csv file and don't want to load it fully into my memory, I need to get only column names from this csv file. How to load it clearly?
Ivan Shelonik
  • 1,958
  • 5
  • 25
  • 49
1
2 3
88 89