So I basically have a huge dataset to work with, its almost made up of 1,200,000 rows, and my target class count is about 20,000 labels.
I am performing text classifiaction on my data, so I first cleaned it, and then performed tfidf vectorzation on it.
The problem lies whenever I try to pick a model and fit the data, it gives me a Memory Error
My current PC is Core i7 with 16GB of RAM
vectorizer = feature_extraction.text.TfidfVectorizer(ngram_range=(1, 1),
analyzer='word',
stop_words= fr_stopwords)
datavec = vectorizer.fit_transform(data.values.astype('U'))
X_train, X_test, y_train, y_test = train_test_split(datavec,target,test_size=0.2,random_state=0)
print(type(X_train))
print(X_train.shape)
Output: class 'scipy.sparse.csr.csr_matrix' (963993, 125441)
clf.fit(X_train, y_train)
This is where the Memory Error is happening
I have tried: 1 - to take a sample of the data, but the error is persisting.
2 - to fit many different models, but only the KNN model was working (but with a low accuracy score)
3- to convert datavec to an array, but this process is also causing a Memory Error
4- to use multi processing on different models
5 - I have been through every similar question on SO, but either an answer was unclear, or did not relate to my problem exactly
This is a part of my code:
vectorizer = feature_extraction.text.TfidfVectorizer(ngram_range=(1, 1),
analyzer='word',
stop_words= fr_stopwords)
df = pd.read_csv("C:\\Users\\user\\Desktop\\CLEAN_ALL_DATA.csv", encoding='latin-1')
classes = np.unique(df['BENEFITITEMCODEID'].str[1:])
vec = vectorizer.fit(df['NEWSERVICEITEMNAME'].values.astype('U'))
del df
clf = [KNeighborsClassifier(n_neighbors=5),
MultinomialNB(),
LogisticRegression(solver='lbfgs', multi_class='multinomial'),
SGDClassifier(loss="log", n_jobs=-1),
DecisionTreeClassifier(max_depth=5),
RandomForestClassifier(n_jobs=-1),
LinearDiscriminantAnalysis(),
LinearSVC(multi_class='crammer_singer'),
NearestCentroid(),
]
data = pd.Series([])
for chunk in pd.read_csv(datafile, chunksize=100000):
data = chunk['NEWSERVICEITEMNAME']
target = chunk['BENEFITITEMCODEID'].str[1:]
datavec = vectorizer.transform(data.values.astype('U'))
clf[3].partial_fit(datavec, target,classes = classes)
print("**CHUNK DONE**")
s = "this is a testing sentence"
svec = vectorizer.transform([s])
clf[3].predict(svec) --> memory error
clf[3].predict(svec).todense() --> taking a lot of time to finish
clf[3].predict(svec).toarrray() --> taking a lot of time to finish as well
Anything else I could try?