Any better pre processing library or implementation in python?

Question

I need to pre-process some text documents so that I can apply classification techniques like fcm e.t.c and other topic modeling techniques like latent dirichlet allocation e.t.c

To elaborate a bit in preprocessing I need to remove the stop words, extract the nouns and keywords and perform stemming. The code which I used for this purpose is:

#--------------------------------------------------------------------------
#Extracting nouns
#--------------------------------------------------------------------------
for i in range (0,len(a)) :
    x=a[i]          
    text=nltk.pos_tag(nltk.Text(nltk.word_tokenize(x)))
    for noun in text:
        if(noun[1]=="NN" or noun[1]=="NNS"):
            temp+=noun[0]
            temp+=' '
documents.append(temp)
print documents

#--------------------------------------------------------------------------
#remove unnecessary words and tags
#--------------------------------------------------------------------------

texts = [[word for word in document.lower().split() if word not in stoplist]for    document in documents]
allTokens = sum(texts, [])
tokensOnce = set(word for word in set(allTokens) if allTokens.count(word)== 0)
texts = [[word for word in text if word not in tokensOnce]for text in texts]
print texts

#--------------------------------------------------------------------------
#Stemming
#--------------------------------------------------------------------------

for i in texts:
    for j in range (0,len(i)):        
        k=porter.stem(i[j])
        i[j]=k
print texts

The problem with the code I mentioned above is

The nltk module used for extracting nouns and keywords is missing many words. For example the pre-processing was performed on some documents and names like 'Sachin' were not recognized as keywords and missed after pre-processing.
The words are not properly stemmed. There is either too much stemming (network and networking to net) and some times nouns are also stemmed.

Is there any better module for the functions needed or is there any better implementation of the same module? Kindly help

score 2 · Answer 1 · answered Apr 26 '12 at 10:59

2

Try Pattern, I really like it: http://www.clips.ua.ac.be/pages/pattern

answered Apr 26 '12 at 10:59

marbdq

1,235
8
5

Any better pre processing library or implementation in python?

1 Answers1