Extracting TF-IDF features as multiple columns with pyspark

Asked Jun 25 '20 at 15:07

Active Jun 25 '20 at 15:14

Viewed 246 times

Usually pyspark.ml.feature.IDF returns one outputCol that contains SparseVector. All i need is having N-columns with real number values, where N is a number of features defined in IDF(to use that dataframe in catboost later).

I have tried to convert column to array

def dense_to_array(v):
      new_array = list([float(x) for x in v])
      return new_array

dense_to_array_udf = F.udf(dense_to_array, T.ArrayType(T.FloatType()))

data = data.withColumn('tf_idf_features_array', dense_to_array_udf('tf_idf_features'))

and after that use Pandas to convert to columns

data = data.toPandas()
cols = [f'tf_idf_{i}' for i in range(32)]
data = pd.DataFrame(info['tf_idf_features_array'].values.tolist(), columns=cols)

I don't like that way, because i find it really slow. Is there a way to solve my problem over pyspark without pandas?

edited Jun 25 '20 at 15:14

asked Jun 25 '20 at 15:07

Gleb

Does [this answer](https://stackoverflow.com/a/38385033/2129801) help? – werner Jun 25 '20 at 16:30

Extracting TF-IDF features as multiple columns with pyspark

0 Answers0