2

I try to do Topic Modeling (with german stop words and german text) after the explanation from: Albrecht, Jens, Sidharth Ramachandran, und Christian Winkler. Blueprints for text analysis using Python: machine learning-based solutions for common real world (NLP) applications. First edition. Sebastopol, CA: O’Reilly Media, Inc, 2020., page 209 ff.

# Load Data
import pandas as pd
# csv Datei über read_csv laden
xlsx = pd.ExcelFile("Priorisierung_der_Anforderungen.xlsx")
df = pd.read_excel(xlsx)

# Anforderungsbeschreibung in String umwandlen
df=df.astype({'Anforderungsbeschreibung':'string'})
df.info()

# "Ignore spaces after the stop..."
import re
df["paragraphs"] = df["Anforderungsbeschreibung"].map(lambda text:re.split('\.\s*\n', text))
df["number_of_paragraphs"] = df["paragraphs"].map(len)

%matplotlib inline
df.groupby('Title').agg({'number_of_paragraphs': 'mean'}).plot.bar(figsize=(24,12))


# Preparations
from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.de.stop_words import STOP_WORDS as stopwords

tfidf_text_vectorizer = TfidfVectorizer(stop_words=stopwords, min_df=5, max_df=0.7)
tfidf_text_vectors = tfidf_text_vectorizer.fit_transform(df['Anforderungsbeschreibung'])
tfidf_text_vectors.shape

I receive this error message:

 InvalidParameterError: The 'stop_words' parameter of TfidfVectorizer must be a str among {'english'}, an instance of 'list' or None.   

InvalidParameterError                     Traceback (most recent call last)
Cell In[8], line 4
  1 #tfidf_text_vectorizer = = TfidfVectorizer(stop_words=stopwords.words('german'),)
  3 tfidf_text_vectorizer = TfidfVectorizer(stop_words=stopwords, min_df=5, max_df=0.7)
----> 4 tfidf_text_vectors = tfidf_text_vectorizer.fit_transform(df['Anforderungsbeschreibung'])
  5 tfidf_text_vectors.shape

InvalidParameterError: The 'stop_words' parameter of TfidfVectorizer must be a str among {'english'}, an instance of 'list' or None.

Thynk you for any tips. Sebastian

SebastianS
  • 477
  • 7
  • 14

1 Answers1

4

The stopwords you've imported from Spacy isn't a list.

from spacy.lang.de.stop_words import STOP_WORDS

type(STOP_WORDS)

[out]:

set

Cast the stopwords into a list and it should work as expected.

from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.de.stop_words import STOP_WORDS


tfidf_text_vectorizer = TfidfVectorizer(stop_words=list(STOP_WORDS))
alvas
  • 115,346
  • 109
  • 446
  • 738
  • 2
    is this a change implemented in sklearn recently? From my own experience, it used to be fine passing a set. Older answers in stackoverflow also show that it used to accept a frozenset, for example, https://stackoverflow.com/questions/24386489/adding-words-to-scikit-learns-countvectorizers-stop-list – 3123 Aug 13 '23 at 11:20