I need to pre-process some text documents so that I can apply classification techniques like fcm e.t.c and other topic modeling techniques like latent dirichlet allocation e.t.c
To elaborate a bit in preprocessing I need to remove the stop words, extract the nouns and keywords and perform stemming. The code which I used for this purpose is:
#--------------------------------------------------------------------------
#Extracting nouns
#--------------------------------------------------------------------------
for i in range (0,len(a)) :
x=a[i]
text=nltk.pos_tag(nltk.Text(nltk.word_tokenize(x)))
for noun in text:
if(noun[1]=="NN" or noun[1]=="NNS"):
temp+=noun[0]
temp+=' '
documents.append(temp)
print documents
#--------------------------------------------------------------------------
#remove unnecessary words and tags
#--------------------------------------------------------------------------
texts = [[word for word in document.lower().split() if word not in stoplist]for document in documents]
allTokens = sum(texts, [])
tokensOnce = set(word for word in set(allTokens) if allTokens.count(word)== 0)
texts = [[word for word in text if word not in tokensOnce]for text in texts]
print texts
#--------------------------------------------------------------------------
#Stemming
#--------------------------------------------------------------------------
for i in texts:
for j in range (0,len(i)):
k=porter.stem(i[j])
i[j]=k
print texts
The problem with the code I mentioned above is
- The nltk module used for extracting nouns and keywords is missing many words. For example the pre-processing was performed on some documents and names like 'Sachin' were not recognized as keywords and missed after pre-processing.
- The words are not properly stemmed. There is either too much stemming (network and networking to net) and some times nouns are also stemmed.
Is there any better module for the functions needed or is there any better implementation of the same module? Kindly help