-1

I have this sklearn code from one tutorial:

pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', bow_vector),
                 ('classifier', classifier)])

I want to transform it to normal code, something like this:

X_train = predictors.fit_transform(X_train)
X_train = bow_vector.fit_transform(X_train)
classifier.fit(X_train)

But I constantly get errors. Quick reading of the documentation didn't help

UPD

My exact code is

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
df = pd.read_excel('data.xlsx')
from sklearn.model_selection import train_test_split

X = df['X'] 
ylabels = df['y']

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3, random_state=42)

List of punctuation marks

punctuations = string.punctuation

Natural Language Processing engine

nlp = spacy.load('en')

List of stop words

stop_words = spacy.lang.en.stop_words.STOP_WORDS

Load English tokenizer, tagger, parser, NER and word vectors

parser = English()

Tokenizer

def spacy_tokenizer(sentence):
    # Creating an token object
    mytokens = parser(sentence)

    # Lemmatizing each token and converting each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Removing stop words
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # return preprocessed list of tokens
    return mytokens

First element of pipeline

class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        # Cleaning Text
        return [clean_text(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

Basic function to clean the text

def clean_text(text):
    # Removing spaces and converting text into lowercase
    return text.strip().lower()
sokolov0
  • 103
  • 5
  • 2
    Well, what are the errors? And what is your exact code? I would need to now the data types of your variables – Ralf Sep 05 '19 at 14:21
  • You should be able to call all these methods on `pipe`, e.g. `X_train = pipe.fit_transform(X_train)`. There should be no need to "disassemble" the pipeline if all you want to do is use it. [sklearn.pipeline.Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) docs – Lomtrur Sep 05 '19 at 14:22
  • Related: [Python - What is exactly sklearn.pipeline.Pipeline?](https://stackoverflow.com/questions/33091376/python-what-is-exactly-sklearn-pipeline-pipeline) – Lomtrur Sep 05 '19 at 14:25
  • I understand that I can do all operation in the pipeline at once, but I want to dig deeper and understand what happens with X_train step by step, to do this I want to do the first operation first, then look at X_train, then do the second operation and so on – sokolov0 Sep 05 '19 at 14:25
  • I want to post my code, but stack overflow says: "It looks like your post is mostly code; please add some more details." – sokolov0 Sep 05 '19 at 14:30
  • Mostly the code is from the tutorial – sokolov0 Sep 05 '19 at 14:32
  • Split your code into logical blocks and add explanations before each block. That way the ratio between code and text should balance out and you should be able to post it. – Lomtrur Sep 05 '19 at 14:32
  • 1
    @sokolov0 the goal is to post a [mre], not the whole code – Ralf Sep 05 '19 at 14:39
  • @sokolov0 also, what is the full error you are getting? – Ralf Sep 05 '19 at 14:39
  • @Lomtrur I added code – sokolov0 Sep 06 '19 at 07:11
  • @Ralf is it okay, or I should post a shorter code? – sokolov0 Sep 06 '19 at 07:11

1 Answers1

0

I solved my problem.

tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)

cleaner = predictors()

X_train_cleaned = cleaner.transform(X_train)

X_train_transformed = tfidf_vector.fit_transform(X_train_cleaned)

classifier = LogisticRegression(solver='lbfgs')

classifier.fit(X_train_transformed, y_train)

cleaner = predictors()

X_test_cleaned = cleaner.transform(X_test)

X_test_transformed = tfidf_vector.transform(X_test_cleaned)
sokolov0
  • 103
  • 5