0

I have a data set of 1 million records as below

sample DF1:-

  articles_urlToImage   feed_status status    keyword
   hhtps://rqqkf.com    untagged     tag      the apple,a mobile phone
   hhtps://hqkf.com    tagged       ingore    blackberry, the a phone 
   hhtps://hqkf.com     untagged     tag      amazon, an shopping site

now I want to remove stopwords and some custom stopwords as below

custom stop words = ['phone','site'] (I have around 35 custom stop words)

expected out put

 articles_urlToImage    feed_status status    keyword
   hhtps://rqqkf.com    untagged     tag     apple,mobile
   hhtps://hqkf.com     tagged       ingore    blackberry 
   hhtps://hqkf.com     untagged     tag      amazon,shopping 

I have tried to remove stopwords but I am getting below error

code

import nltk
import string
from nltk.corpus import stopwords
stop = stopwords.words('english') 

df1['keyword'] = df1['keyword'].apply(lambda x: [item for item in x if item not in stop])

error

  /usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in __getattr__(self, name)
   3612             if name in self._info_axis:
   3613                 return self[name]
-> 3614             return object.__getattribute__(self, name)
   3615 
   3616     def __setattr__(self, name, value):

AttributeError: 'Series' object has no attribute 'split'
Rahul Varma
  • 550
  • 5
  • 23
  • Does https://stackoverflow.com/questions/48049087/nltk-based-text-processing-with-pandas/48049425#48049425 or https://stackoverflow.com/questions/51914481/stopword-removal-with-pandas/51914517#51914517 help? – cs95 Dec 18 '18 at 06:30
  • getting this error `LookupError: ********************************************************************** Resource stopwords not found. Please use the NLTK Downloader to obtain the resource: >>> import nltk >>> nltk.download('stopwords') Searched in: - '/root/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' - '/usr/nltk_data' - '/usr/lib/nltk_data' **********************************************************************` – Rahul Varma Dec 18 '18 at 06:33
  • Google search is just one click away, no? – cs95 Dec 18 '18 at 06:40

1 Answers1

0

You can use:

from nltk.corpus import stopwords
stop = stopwords.words('english') 
custom  = ['phone','site']
#join lists together
stop = custom + stop

#remove punctuation, split by whitespace and remove stop words
df1['keyword'] = (df1['keyword'].str.replace(r'[^\w\s]+', ' ')
                    .apply(lambda x: [item for item in x.split() if item not in stop]))
print (df1)
  articles_urlToImage feed_status  status             keyword
0   hhtps://rqqkf.com    untagged     tag     [apple, mobile]
1    hhtps://hqkf.com      tagged  ingore        [blackberry]
2    hhtps://hqkf.com    untagged     tag  [amazon, shopping]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I am getting error while running `stop = stopwords.words('english') ` error `--------------------------------------------------------------------------- LookupError Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/nltk/corpus/util.py in __load(self) 79 except LookupError as e: ---> 80 try: root = nltk.data.find('{}/{}'.format(self.subdir, zip_name)) 81 except LookupError: raise e` – Rahul Varma Dec 18 '18 at 06:41
  • @RahulVarma - hmmm, it means problem with `ntlk`, do you install ntlk sopwords by `NLTK Downloader`? – jezrael Dec 18 '18 at 06:43