3

(1)My goal: I am trying to use SVM to classify 10000 documents(each with 400 words) into 10 classes(evenly distributed). The features explored in my work include word n-gram (n=1~4),character n-gram(n=1~6).

(2)My approach: I am representing each document using vectors of frequency values for each feature in the document. And using TF-IDF to formalize the vectors. parts of my code are below:

def commonVec(dicts,count1,count2):
    ''' put features with frequency between count1 and count2 into a common vector used for SVM training''' 
    global_vector = []
    master = {}
    for i, d in enumerate(dicts):
        for k in d:
            master.setdefault(k, []).append(i)
    for key in master:
        if (len(master[key])>=count1 and len(master[key])<=count2):  
            global_vector.append(key)
    global_vector1 = sorted(global_vector)
    return global_vector1 
def featureComb(mix,count1,count2,res1):
    '''combine word n-gram and character n-gram into a vector'''
    if mix[0]:
        common_vector1 = []
        for i in mix[0]:
            dicts1 = []
            for res in res1: #I stored all documents into database. res1 is the document result set and res is each document. 
                dicts1.append(ngram.characterNgrams(res[1], i)) # characterNgrams()will return a dictionary with feature name as the key, frequency as the value.
            common_vector1.extend(commonVec(dicts1, count1, count2))
    else:
        common_vector1 = []
    if mix[1]:
        common_vector2 = []
        for j in mix[1]:
            dicts2 = []
            for res in res1:
                dicts2.append(ngram.wordNgrams(res[1], j))        
            common_vector2.extend(commonVec(dicts2, count1, count2))
    else:
        common_vector2 = []
    return common_vector1+common_vector2

def svmCombineVector(mix,global_combine,label,X,y,res1):
    '''Construct X vector that can be used to train SVM'''
    lstm = []
    for res in res1:            
        y.append(label[res[0]]) # insert class label into y

        dici1 = {}
        dici2 = {}
        freq_term_vector = []
        for i in mix[0]:             
            dici1.update(ngram.characterNgrams(res[1], i))
        freq_term_vector.extend(dici1[gram] if gram in dici1 else 0 for gram in global_combine)    
        for j in mix[1]:
            dici2.update(ngram.wordNgrams(res[1], j))
        freq_term_vector.extend(dici2[gram] if gram in dici2 else 0 for gram in global_combine)
        lstm.append(freq_term_vector)
    freq_term_matrix = np.matrix(lstm)
    transformer = TfidfTransformer(norm="l2")
    tfidf = transformer.fit_transform(freq_term_matrix)
    X.extend(tfidf.toarray())

X = []
y = []
character = [1,2,3,4,5,6]
word = [1,2,3,4]
mix = [character,word]
global_vector_combine = featureComb(mix, 2, 5000, res1)
print len(global_vector_combine) # 542401
svmCombineVector(mix,global_vector_combine,label,X,y,res1)
clf1 = svm.LinearSVC()
clf1.fit(X, y)

(3)My problem: However, when I ran the source code, a memory error occurred.

Traceback (most recent call last):
      File "svm.py", line 110, in <module>
        functions.svmCombineVector(mix,global_vector_combine,label,X,y,res1)
      File "/home/work/functions.py", line 201, in svmCombineVector
        X.extend(tfidf.toarray())
      File "/home/anaconda/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 901, in toarray
        return self.tocoo(copy=False).toarray(order=order, out=out)
      File "/home/anaconda/lib/python2.7/site-packages/scipy/sparse/coo.py", line 269, in toarray
        B = self._process_toarray_args(order, out)
      File "/home/anaconda/lib/python2.7/site-packages/scipy/sparse/base.py", line 789, in _process_toarray
    _args
        return np.zeros(self.shape, dtype=self.dtype, order=order)
    MemoryError

I really have a hard time with it and need help from stackoverflow.

  1. Could anyone explain some details and give me some idea how to solve it?
  2. could anyone check my source code and show me some other methods to make use of memory more effectively?
allenwang
  • 727
  • 2
  • 8
  • 25
  • What is the matrix size? – Kobi K Dec 29 '14 at 13:46
  • 1
    The error message doesn't match the code. `freq_term_matrix = np.matrix(lstm)` vs. `freq_term_matrix = np.matrix(matrix)`. It's also unclear what `matrix` is. `MemoryError` suggests that you're trying to create a matrix that is too large -- which is very possible when doing NLP. – senderle Dec 29 '14 at 13:48
  • [shai geva](http://stackoverflow.com/users/3862437/shai-geva) asks: Can you post more of your code? What is "lstm"? Can you show a portion of "common" and "dici"? What is the matrix in matrix.append(freq_term_vector)? Btw, this looks like a RAM allocation problem. The answer would probably be "don't use the list, create the matrix directly", but let's see more details to be sure. – senderle Dec 29 '14 at 13:54
  • @KobiK I did not assign the size of matrix explicitly. the size of the vector used for SVM training is 542401 – allenwang Dec 29 '14 at 14:28
  • @senderle Sorry for my typo. I have edited the question. The source code is a little large, so I add some comments on my code. please help me solve it, bro – allenwang Dec 29 '14 at 14:32
  • @senderle After editing the question, maybe it is a little clearer to understand my question now. I want to convert list of lists into matrix that can be used in `TfidfTransformer`. Indeed the dimension of each list is 542401,but I really don't know why such a error occurred. Please give me some idea. – allenwang Dec 29 '14 at 14:41
  • OK, so I saw that the dimension of each list was 542401, but that's not enough information to know the size of the `matrix`. I see from your comments that the matrix will probably be `(n_documents, n_features)` where `n_features = 542401`. What is `n_documents`? – senderle Dec 29 '14 at 15:15
  • @senderle Hi bro, I am doing a text classification task. `n_documents = 10000`When I run my code on server, it will use memory of over 200G and get `MemoryError`. But I don't know how to optimize my code.Is there any difference on memory use between `list of lists` and `matrix`? I really need your help! – allenwang Dec 30 '14 at 04:33
  • Well, first, do the math. How much memory will a (542401, 10000) array take up at a minimum? (1 byte per datapoint) How much memory if you store floats? (8 bytes per datapoint.) Take a moment to do the calculations. Then find out how much memory is available to you. You are probably going to find that you're trying to do the impossible! In that case, you're going to have to rethink this from the ground up. – senderle Dec 30 '14 at 10:24
  • @senderle Hi,thanks for your reply, but how did you find out it is impossible? 542401*10000*8 byte? In fact I do not how to calculate it exactly. could you please show me some details? – allenwang Dec 30 '14 at 11:32
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/67913/discussion-between-allenwang-and-senderle). – allenwang Dec 30 '14 at 12:21
  • @senderle As I am a newbie in python programming, maybe there are many inefficient lines in my code. Anyway, hope you can help me. – allenwang Dec 31 '14 at 05:06

1 Answers1

2

The main problem you're facing is that you're using far too many features. It's actually quite extraordinary that you've managed to generate 542401 features from documents that contain just 400 words! I've seen SVM classifiers separate spam from non-spam with high accuracy using just 150 features -- word counts of selected words that say a lot about whether the document is spam. These use stemming and other normalization tricks to make the features more effective.

You need to spend some time thinning out your features. Think about which features are most likely to contain information useful for this task. Experiment with different features. As long as you keep throwing everything but the kitchen sink in, you'll get memory errors. Right now you're trying to pass 10000 data points with 542401 dimensions each to your SVM. That's 542401 * 10000 * 4 = 21 gigabytes (conservatively) of data. My computer only has 4 gigabytes of RAM. You've got to pare this way down.1

A first step towards doing so would be to think about how big your total vocabulary size is. Each document has only 400 words, but let's say those 400 words are taken from a vocabulary of 5000 words. That means there will be 5000 ** 4 = 6.25 * 10 ** 14 possible 4-grams. That's half a quadrillion possible 4-grams. Of course not all those 4-grams will appear in your documents, but this goes a long way towards explaining why you're running out of memory. Do you really need these 4-grams? Could you get away with 2-grams only? There are a measly 5000 ** 2 = 25 million possible 2-grams. That will fit much more easily in memory, even if all possible 2-grams appear (unlikely).

Also keep in mind that even if the SVM could handle quadrillions of datapoints, it would probably give bad results, because when you give any learning algorithm too many features, it will tend to overfit, picking up on irrelevant patterns and overgeneralizing from them. There are ways of dealing with this, but it's best not to deal with it at all if you can help it.

I will also mention that these are not "newbie" problems. These are problems that machine learning specialists with PhDs have to deal with. They come up with lots of clever solutions, but we're not so clever that way, so we have to be clever a different way.

Although I can't offer you specific suggestions for cleverness without knowing more, I would say that, first, stemming is a good idea in at least some cases. Stemming simply removes grammatical inflection, so that different forms of the same word ("swim" and "swimming") are treated as identical. This will probably reduce your vocabulary size significantly, at least if you're dealing with English text. A common choice is the porter stemmer, which is included in nltk, as well as in a number of other packages. Also, if you aren't already, you should probably strip punctuation and reduce all words to lower-case. From there, it really depends. Stylometry (identifying authors) sometimes requires only particles ("a", "an", "the"), conjunctions ("and", "but") and other very common words; spam, on the other hand, has its own oddball vocabularies of interest. At this level, it is very difficult to say in advance what will work; you'll almost certainly need to try different approaches to see which is most effective. As always, testing is crucial!

1. Well, possibly you have a huge amount of RAM at your disposal. For example, I have access to a machine with 48G of RAM at my current workplace. But I doubt it could handle this either, because the SVM will have its own internal representation of the data, which means there will be at least one copy at some point; if a second copy is needed at any point -- kaboom.

Community
  • 1
  • 1
senderle
  • 145,869
  • 36
  • 209
  • 233
  • Thanks for your reply.You are right. We have to be clever in a different way.Could you please show me examples to thin out features or other normalization tricks to make the features more effective? – allenwang Dec 31 '14 at 23:47
  • @allenwang, well, that's going to depend on what you're trying to do. It would really help if you would say more about the categories you're trying to place the texts in! Stemming will help in some cases, but hurt in others. The best suggestion I have, given what you've told me so far, is to avoid 4-grams and 3-grams. Just use 2-grams and individual words. – senderle Jan 01 '15 at 01:45
  • @allenwang, what's happening here is connected to the [no free lunch](http://en.wikipedia.org/wiki/No_free_lunch_in_search_and_optimization) theorems. There's no fully general approach to machine learning problems that is fast for everything; approaches have to be tailored to specific problems. – senderle Jan 01 '15 at 16:16
  • Thanks for your kind answer and comments. Yeah, it is more effective to only use word uni-gram and bi-gram. Besides, do you know some feature selection methods? I mean, among hundreds of thousand features select the most informative tens of thousand ones and therefore reduce the vector dimension? – allenwang Jan 02 '15 at 04:51
  • Thanks for your patient instructions. During the two days experiment, I obtained a more informative set of feature combination. However, the memory is still overused. As I am not a python specialist, could you please help me review my code and see if there is some lines where I can improve the memory utilization? Thanks in advance! – allenwang Jan 05 '15 at 12:40
  • To be honest, it's not obvious to me how you would reduce the memory usage of the python code itself. There's a lower bound on memory optimization here. The best advice I can give you is very generic here. Use lists instead of dictionaries when possible. Avoid global variables, especially for temporary objects. Write functions that do just one thing, so that temporary objects they create can be garbage-collected. Tell me: what is the length of your feature vectors now? – senderle Jan 05 '15 at 20:35
  • thanks for your reply. Now the length of my feature vector is 346693. – allenwang Jan 06 '15 at 01:03
  • @allenwang -- Ok, so you've cut the size by 2/3, but that's still going to be 346693 * 10000 * 4 (or 8) bytes = 13 (or 26) GB. Nothing anyone could possibly tell you about Python can help that lower bound. No knowledge of _any_ programming language can help you with this problem. It looks to me like you're still trying to do the impossible. Can you not solve the problem with unigrams alone? A _lot_ of problems can be solved that way! – senderle Jan 06 '15 at 02:08
  • If you really can't make your feature vectors any smaller, you'll need to look into online learning algorithms (i.e. algorithms that don't require you to load all the examples at once). SVMs aren't very good at that; see [here](http://stats.stackexchange.com/questions/26041/can-svm-do-stream-learning-one-example-at-a-time) if you still want to pursue that angle. Otherwise, I suggest using neural networks with stochastic gradient descent instead. If none of what I just said made any sense, just take Andrew Ng's [fantastic machine learning course](https://www.coursera.org/course/ml). – senderle Jan 06 '15 at 02:10
  • Thank you very much again. I really appreciate your favorable instructions. Although I have to deal with the memory problem, the ultimate target is to obtain accuracy as high as possible. Besides linear SVMs, I do not know if there is any other machine learning model can achieve this goal. Are you sure neural networks with stochastic gradient descent can get high accuracy? – allenwang Jan 06 '15 at 04:53
  • Sorry, I've been busy with other work. Yes, neural networks definitely can get high accuracy in many cases, though it depends on the problem domain. In fact, under certain conditions, neural networks and SVMs are equivalent. But I think in general, when you need to train a learning algorithm on a very large amount of data, neural networks are better than SVMs. The reason people use SVMs instead of neural networks in practice is that they seem to require less tuning. – senderle Jan 10 '15 at 21:57
  • Hi, I was very busy with other work in these two months. Now I want to continue to complete my text classification task using neural networks. But after surveying some python libraries, I am struggling with PyBrain or Neurolab. Could you please give me some advice? I really need your help. – allenwang Mar 20 '15 at 11:43
  • I'm afraid I can't help you any more if you don't tell me your specific task. And at this stage, we should move towards a new question -- this is no longer directly related to your original question. You should ask a new one, with _lots_ of detail, saying (1) what kind of texts you're classifying, (2) what kind of classes you want, and (3) what you've tried so far -- that is, show some actual code. – senderle Mar 20 '15 at 15:40