Rewrite NLTK code to a function which can be used multiple times in Python

Question

How to rewrite my code into a function which can be called again

My code

stopwords=nltk.corpus.stopwords.words('english')
user_defined_stop_words=['st','rd','kwun tong','kwai chung','kwun','tong']                    
new_stop_words=stopwords+user_defined_stop_words
data['Clean_addr'] = data['Adj_Addr'].apply(lambda x: ' '.join([item.lower() for item in x.split()]))
data['Clean_addr']=data['Clean_addr'].apply(lambda x:"".join([item.lower() for item in x if  not  item.isdigit()]))
data['Clean_addr']=data['Clean_addr'].apply(lambda x:"".join([item.lower() for item in x if item not in string.punctuation]))
data['Clean_addr'] = data['Clean_addr'].apply(lambda x: ' '.join([item.lower() for item in x.split() if item not in (new_stop_words)]))
cv = CountVectorizer( max_features = 200,analyzer='word',ngram_range=(1, 3)) 
cv_addr = cv.fit_transform(data.pop('Clean_addr'))
for i, col in enumerate(cv.get_feature_names()):
    data[col] = pd.SparseSeries(cv_addr[:, i].toarray().ravel(), fill_value=0)

Any help appreciated.

This is NOT a code re-writing/refactory service. You better explain what it is you're trying to do, with some sample input data pasted as text in your question, and some expected output. Read this link: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — cs95, Jan 01 '18 at 09:30

score -1 · Answer 1 · answered Jan 01 '18 at 09:52

Here is a reference from my code

import nltk
import string
wnlemma = nltk.WordNetLemmatizer()
addstopwords = ['the','is','it','may','was', '1', '2', '3', '4', '5', '6', 
               '7', '8', '9', '0', 'employee', 'employer', 'approximately']
newstopwords=stopwords.words("English") + addstopwords

# pre-process and join into string function
def pre_process_str(text):
    # tokenize
    tokens = word_tokenize(text)

    # lower-case and remove stopwords
    tokens=[word.lower() for word in tokens if word not in newstopwords]

    # wordnet lemmatization
    tokens=[wnlemma.lemmatize(t) for t in tokens]

    # remove puncutation
    tokens=[word for word in tokens if word not in string.punctuation]

    # remove words less than 3 letters
    tokens = [word for word in tokens if len(word) >= 3]

    # join as string
    text_after_process=" ".join(tokens)

    return(text_after_process)

This is really bad code you're looping through the same tokens multiple times when the functions could have been lumped up as one or a couple... — alvas, Jan 01 '18 at 13:41
See https://stackoverflow.com/questions/48049087/applying-nltk-based-text-pre-proccessing-on-a-pandas-dataframe and https://stackoverflow.com/questions/47769818/why-is-my-nltk-function-slow-when-processing-the-dataframe and https://www.kaggle.com/alvations/basic-nlp-with-nltk — alvas, Jan 01 '18 at 13:42

Rewrite NLTK code to a function which can be used multiple times in Python

1 Answers1