How do I remove stopwords from amazon_baby.csv in python

Question

I want to remove stop words and punctuations in Amazon_baby.csv.

import pandas as pd
data=pd.read_csv('amazon_baby.csv)
data.fillna(value='',inplace=True)
data.head()

Amazon_baby.csv

import string
from nltk.corpus import stopwords

def text_process(msg):      
    no_punc=[char for char in msg if char not string.punctuation]
    no_punc=''.join(no_punc)

   return [word for word in no_punc.split() if word.lower() not in stopwords.words('English')]

data['review'].apply(text_process)

This code executing upto 10k rows , if apply on entire dataset kernel always showing as busy and cell is not executing .

Please help on this.

Find the data set here.

score 3 · Answer 1 · answered Jun 09 '18 at 16:05

You are processing the data char by char which is extremely slow.

It is because of the vast size of the data (~183531 rows) and we have to process each row individually which makes the complexity to O(n²). I have implemented a slightly different approach using word_tokenize below:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def remove_punction_and_stopwords(msg):
   stop_words = set(stopwords.words('english'))
   word_tokens = word_tokenize(msg)
   filtered_words = [w for w in msg if w not in word_tokens and w not in string.punctuation]
   new_sentence = ''.join(filtered_words)
   return new_sentence

I tried running it for 6 mins and it processed 136322 rows. I'm sure if I had run it for 10 mins it would have completed execution successfully.

Hi Vivek, As you said the code is executed in 10 mins, but output is little bit different.I did small change on it.Now its working. Thank you @Vivek — Venkat, Jun 09 '18 at 18:43
Please don't do this, it's slow... See https://stackoverflow.com/questions/47769818/why-is-my-nltk-function-slow-when-processing-the-dataframe — alvas, Jun 09 '18 at 23:58

Venkat · Answer 2 · 2018-06-09T19:00:23.610

-1

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def text_clean(msg):

tokens=word_tokenize(msg)
tokens=[w.lower() for w in tokens]
import string
stop_words=set(stopwords.words('english))
no_punc_and_stop_words=[w for w in tokens if w not in string.punctuation and w not in stop_words]  

return words

edited Jun 09 '18 at 19:00

answered Jun 09 '18 at 18:49

Venkat

79
2
6

Please don't do this, it's also slow... See https://stackoverflow.com/questions/47769818/why-is-my-nltk-function-slow-when-processing-the-dataframe – alvas Jun 09 '18 at 23:59

How do I remove stopwords from amazon_baby.csv in python

2 Answers2