1

I'm using a "ColumnTransformer" even though I'm transforming only one feature because I don't know how else to change only the "clean_text" feature. I am not using a "make_column_transformer" with a "make_column_selector" because I would like to use a gridsearch later but I don't understand why I can't find column 0 of the dataset

import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.model_selection import train_test_split

#dataset download: https://www.kaggle.com/saurabhshahane/twitter-sentiment-dataset 

df = pd.read_csv('Twitter_Data.csv')
y = df1['category']   #target
X = df1['clean_text'].values.astype('U') #feature, i transformed "X" into a string even if in theory it was because otherwise it would return an error

transformers = [
    ['text_vectorizer', CountVectorizer(), [0]];
]

ct = ColumnTransformer(transformers, remainder='passthrough')

ct.fit(X) #<---IndexError: tuple index out of range
X = ct.transform(X)
  • Please always provide the full error traceback. – Ben Reiniger Mar 01 '22 at 15:06
  • I'm not sure if this could be the culprit, but usually the transformers should be given as a tuple, not a list: `(text_vectorizer', CountVectorizer(), [0])` (and I don't know what the semicolon was doing there). – Ben Reiniger Mar 01 '22 at 15:07

1 Answers1

1

Imo there are a couple of points to be highlighted on this example:

  • CountVectorizer requires its input to be 1D. In such cases, documentation for ColumnTransformer states that

columns: str, array-like of str, int, array-like of int, array-like of bool, slice or callable

Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer.

Therefore, the columns parameter should be passed as an int rather than as a list of int. I would also suggest Sklearn custom transformers with pipeline: all the input array dimensions for the concatenation axis must match exactly for another reference.

  • Given that you're using a column transformer, I would pass the whole dataframe to method .fit() called on the ColumnTransformer instance, rather than X only.

  • The dataframe seems to have missing values; it might be convenient to process them somehow. For instance, by dropping them and applying what is described above I was able to make it work, but you can also decide to proceed differently.

     import pandas as pd
     import numpy as np
     from sklearn.compose import ColumnTransformer
     from sklearn.feature_extraction.text import CountVectorizer
     from sklearn.model_selection import train_test_split
    
     #dataset download: https://www.kaggle.com/saurabhshahane/twitter-sentiment-dataset 
     df = pd.read_csv('Twitter_Data.csv')
     y = df['category']  
     X = df['clean_text']
    
     df.info()
    
     df_n = df.dropna()
    
     transformers = [
         ('text_vectorizer', CountVectorizer(), 0)
     ]
    
     ct = ColumnTransformer(transformers, remainder='passthrough')
    
     ct.fit(df_n) 
     ct.transform(df_n)
    
  • As specified within the comments, transformers should be specified as a list of tuples (as per the documentation) rather than as list of lists. However, running the snippet above with your transformers specification seems to work. I've eventually observed that substituting tuples with lists elsewhere (in unrelated pieces of code I have) seems not to raise issues. However, as per my experience, it is for sure more common to find them passed as list of tuples.

amiola
  • 2,593
  • 1
  • 11
  • 25
  • Your solution is working but if i print "df_n" the whole data is exactly the same as before the vectorization – daniele orsucci Mar 04 '22 at 13:49
  • Not sure I get your point; you should look at the output of `ct.transform(df_n)` to see how data has been transformed after the application of `CountVectorizer()` on the first column of the original dataset. – amiola Mar 04 '22 at 14:14
  • With "ct.transform (df_n)" it gives me the statistic of the dataset, but if i print the dataset shouldn't it return the 'clean_text' column as numbers? to train the model with? – daniele orsucci Mar 04 '22 at 15:22
  • Actually, it does. Perhaps the misunderstanding comes from the fact that it might return a sparse matrix. To get the dataset back as a numpy array try typing `ct.transform(df_n).toarray()` (and eventually reassign it back to df_n if needed) – amiola Mar 04 '22 at 16:10