1

So I basically have a huge dataset to work with, its almost made up of 1,200,000 rows, and my target class count is about 20,000 labels.

I am performing text classifiaction on my data, so I first cleaned it, and then performed tfidf vectorzation on it.

The problem lies whenever I try to pick a model and fit the data, it gives me a Memory Error

My current PC is Core i7 with 16GB of RAM

vectorizer = feature_extraction.text.TfidfVectorizer(ngram_range=(1, 1),
                         analyzer='word',
                         stop_words= fr_stopwords)

datavec = vectorizer.fit_transform(data.values.astype('U'))

X_train, X_test, y_train, y_test = train_test_split(datavec,target,test_size=0.2,random_state=0)


print(type(X_train))
print(X_train.shape)

Output: class 'scipy.sparse.csr.csr_matrix' (963993, 125441)

clf.fit(X_train, y_train)

This is where the Memory Error is happening

I have tried: 1 - to take a sample of the data, but the error is persisting.

2 - to fit many different models, but only the KNN model was working (but with a low accuracy score)

3- to convert datavec to an array, but this process is also causing a Memory Error

4- to use multi processing on different models

5 - I have been through every similar question on SO, but either an answer was unclear, or did not relate to my problem exactly

This is a part of my code:

vectorizer = feature_extraction.text.TfidfVectorizer(ngram_range=(1, 1),
                         analyzer='word',
                         stop_words= fr_stopwords)



  df = pd.read_csv("C:\\Users\\user\\Desktop\\CLEAN_ALL_DATA.csv", encoding='latin-1')
classes = np.unique(df['BENEFITITEMCODEID'].str[1:])

vec = vectorizer.fit(df['NEWSERVICEITEMNAME'].values.astype('U'))

del df


clf = [KNeighborsClassifier(n_neighbors=5),
   MultinomialNB(),
   LogisticRegression(solver='lbfgs', multi_class='multinomial'),
   SGDClassifier(loss="log", n_jobs=-1),
   DecisionTreeClassifier(max_depth=5),
   RandomForestClassifier(n_jobs=-1),
   LinearDiscriminantAnalysis(),
   LinearSVC(multi_class='crammer_singer'),
   NearestCentroid(),
  ]

data = pd.Series([])

for chunk in pd.read_csv(datafile, chunksize=100000):

   data =  chunk['NEWSERVICEITEMNAME']
   target = chunk['BENEFITITEMCODEID'].str[1:]

   datavec = vectorizer.transform(data.values.astype('U'))

   clf[3].partial_fit(datavec, target,classes = classes)
   print("**CHUNK DONE**")

s = "this is a testing sentence"
svec = vectorizer.transform([s])

clf[3].predict(svec)  --> memory error
clf[3].predict(svec).todense()  --> taking a lot of time to finish
clf[3].predict(svec).toarrray()  --> taking a lot of time to finish as well

Anything else I could try?

RalphCh97
  • 61
  • 9
  • https://scikit-learn.org/0.15/modules/scaling_strategies.html – Corentin Limier Jul 10 '19 at 11:25
  • `to take a sample of the data, but the error is persisting.` -> i doubt that the error persits whatever the size of the sample. Try some different sizes and you may have an idea of the number or rows you can handle in ram. – Corentin Limier Jul 10 '19 at 11:28
  • 1
    `to use multi processing on different models` -> multiprocessing does not decrease memory usage, au contraire – Corentin Limier Jul 10 '19 at 11:29
  • @CorentinLimier with all due respect, I have took 10% of the data and it was still happening .. I can't keep going lower, I would loose many targets – RalphCh97 Jul 10 '19 at 11:31
  • @CorentinLimier yess you are right, but I saw it as an answer on one of the SO questions so I said I'd say it – RalphCh97 Jul 10 '19 at 11:32
  • I believe you I mean I know that 2% of the dataset would give you bad results but it may help you understand how much you can handle in ram – Corentin Limier Jul 10 '19 at 11:34
  • You may try this : https://scikit-multiflow.github.io/scikit-multiflow/skmultiflow.classification.lazy.knn.html – Corentin Limier Jul 10 '19 at 11:35
  • @CorentinLimier okay thanks a lot, So the only solution is to use the partial fit method on the data? (Knowing that not all models support it) – RalphCh97 Jul 10 '19 at 11:38
  • I think that the easiest solutions are listed on the first link I gave you. Handle big datasets with little memory is generally not an easy problem :) – Corentin Limier Jul 10 '19 at 11:43
  • @CorentinLimier too bad the TFIDF vectorizer does not work well with the partial fit method There are no clear solutions to this problem :( – RalphCh97 Jul 10 '19 at 13:56

2 Answers2

0

I don't know what types of algorithms you're using (or more importantly how they're implemented) but have you tried making your x & y inputs generators? The datatype can save a ton of space compared to say lists. Some links:

https://wiki.python.org/moin/Generators

Is there a way to avoid this memory error?

Additionally, I know there are several models which can be trained in parts (like you feed in some data, save the model, load the model and continue to train the same model - I know Gensim's able to do this for instance), which may help as well.

Evan Mata
  • 500
  • 1
  • 6
  • 19
  • Hello, I am currently working with the SGDClassifier(loss="log", n_jobs=-1) according to: https://stackoverflow.com/q/20952418/11154881, I don't think I could use generators Also, yes the SGDClassifier does support partial_fit, but whenever I try to predict a value I have to vectorize it and input it inside the predict function which also leads to a memory error :( – RalphCh97 Jul 11 '19 at 10:03
  • For generators, I'd honestly say just try it and see if it works. When you say you have to "vectorize it," does it mean the training data? Or something you're trying to predict? Either way, what are the downsides of splitting it into smaller peices? – Evan Mata Jul 11 '19 at 14:21
  • Yes, by "it" i was referring to the input that I'm trying to predict The downside is that not every ML model supports this kind of incremental learning – RalphCh97 Jul 12 '19 at 08:59
  • Hi @RalphCh97 I am having the exact same problem with other algorithms, at predict time, the memory peak is high (working through the model to redict), but the memory change is very low (just returning some prediction values). I used some scikit config_context method to restrict memory somehow, but no luck. Can you clarify whether you were able to come to a workaround? – Mauricio Maroto Aug 30 '20 at 05:48
  • @MauricioMaroto I actually did fix the problem and I did answer my own stack overflow question above. I limited the number of max_features in the TfidfVectorizer by setting the value to 10k features This should help with the memory problem – RalphCh97 Aug 31 '20 at 10:13
0

According to: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

TfidfVectorizer contains a parameter called: max_features that takes an int, this parameter would help us pick how many features we want from our matrix, thus giving us a bit of control over the memory issue

Its also worth to mention that both parameters max_df and min_df also help with reducing our matrix size

RalphCh97
  • 61
  • 9