0

I want to remove stop words and punctuations in Amazon_baby.csv.

import pandas as pd
data=pd.read_csv('amazon_baby.csv)
data.fillna(value='',inplace=True)
data.head()

Amazon_baby.csv

import string
from nltk.corpus import stopwords

def text_process(msg):      
    no_punc=[char for char in msg if char not string.punctuation]
    no_punc=''.join(no_punc)

   return [word for word in no_punc.split() if word.lower() not in stopwords.words('English')]

data['review'].apply(text_process)

This code executing upto 10k rows , if apply on entire dataset kernel always showing as busy and cell is not executing .

Please help on this.

Find the data set here.

xrisk
  • 3,790
  • 22
  • 45
Venkat
  • 79
  • 2
  • 6

2 Answers2

3

You are processing the data char by char which is extremely slow.

It is because of the vast size of the data (~183531 rows) and we have to process each row individually which makes the complexity to O(n2). I have implemented a slightly different approach using word_tokenize below:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def remove_punction_and_stopwords(msg):
   stop_words = set(stopwords.words('english'))
   word_tokens = word_tokenize(msg)
   filtered_words = [w for w in msg if w not in word_tokens and w not in string.punctuation]
   new_sentence = ''.join(filtered_words)
   return new_sentence

I tried running it for 6 mins and it processed 136322 rows. I'm sure if I had run it for 10 mins it would have completed execution successfully.

Vivek
  • 123
  • 7
  • Hi Vivek, As you said the code is executed in 10 mins, but output is little bit different.I did small change on it.Now its working. Thank you @Vivek – Venkat Jun 09 '18 at 18:43
  • Please don't do this, it's slow... See https://stackoverflow.com/questions/47769818/why-is-my-nltk-function-slow-when-processing-the-dataframe – alvas Jun 09 '18 at 23:58
-1
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def text_clean(msg):

tokens=word_tokenize(msg)
tokens=[w.lower() for w in tokens]
import string
stop_words=set(stopwords.words('english))
no_punc_and_stop_words=[w for w in tokens if w not in string.punctuation and w not in stop_words]  

return words
Venkat
  • 79
  • 2
  • 6
  • Please don't do this, it's also slow... See https://stackoverflow.com/questions/47769818/why-is-my-nltk-function-slow-when-processing-the-dataframe – alvas Jun 09 '18 at 23:59