-1

I want to select words only if the word in each rows of my column not in stop words and not in string punctuation.

This is my data after tokenizing and removing the stopwords, i also want to remove the punctuation at the same time i remove the stopwords. See in number two after usf there's comma. I think of if word not in (stopwords,string.punctuation) since it would be not in stopwords and not in string.punctuation i see it from here but it resulting in fails to remove stopwords and the punctuation. How to fix this?

data['text'].head(5)
Out[38]: 
0    ['ve, searching, right, words, thank, breather...
1    [free, entry, 2, wkly, comp, win, fa, cup, fin...
2    [nah, n't, think, goes, usf, ,, lives, around,...
3    [even, brother, like, speak, ., treat, like, a...
4                                 [date, sunday, !, !]
Name: text, dtype: object
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

data = pd.read_csv(r"D:/python projects/read_files/SMSSpamCollection.tsv",
                    sep='\t', header=None)

data.columns = ['label','text']

stopwords = set(stopwords.words('english'))

def process(df):
    data = word_tokenize(df.lower())
    data = [word for word in data if word not in (stopwords,string.punctuation)]
    return data

data['text'] = data['text'].apply(process)
random student
  • 683
  • 1
  • 15
  • 33
  • What do you expect the result will be when writing `(stopwords, string.punctuation)`? – Anwarvic May 21 '20 at 16:05
  • You've created a tuple containing two elements, and only those two elements. `stopwords in (stopwords, string.punctuation)` would return `True`, for example – G. Anderson May 21 '20 at 16:09
  • @Anwarvic i expected it to work like `data = [word for word in data if word not in stopwords and word not in string.punctuation]` – random student May 21 '20 at 16:12
  • @G.Anderson yeah, i thought it would return the word that fulfills if the words not `stopwords` and not in `string.punctuation` – random student May 21 '20 at 16:14
  • You should know that `string.punctuation` is a string of characters. `word not in string.punctuation` won't return `True` unless `word` is a punctuation character... is that what you want? – Anwarvic May 21 '20 at 16:15
  • What I'm saying is that the only elements in that tuple are the variables/attributes/elements that you put in, `stopwords` and `string.punctuation`. There's nothing magical about a tuple that automatically expands the things you put inside it. Try printing the tuple and see what the output is – G. Anderson May 21 '20 at 16:18
  • i know that. but since i tokenized it with the nltk then the punctuation character is separated from the word and become on it's own tuple in the list of words. and i thought i could remove it that way tho i'm not really sure – random student May 21 '20 at 16:19

3 Answers3

1

then you need to change

data = [word for word in data if word not in (stopwords,string.punctuation)]

to

data = [word for word in data if word not in stopwords and word not in string.punctuation]
SuperStew
  • 2,857
  • 2
  • 15
  • 27
1

If you still want to do it in one if statement, you could convert string.punctuation to a set and combine it with stopwords with an OR operation. This is how it would look like:

data = [word for word in data if word not in (stopwords|set(string.punctuation))]
Prateek Dewan
  • 1,587
  • 3
  • 16
  • 29
1

in function process you must Convert type(String) to pandas.core.series.Series and use concat

the function will be:

' def process(df):

  data = word_tokenize(df.lower())

  data = [word for word in data if word not in 
  pd.concat([stopwords,pd.Series(string.punctuation)])  ]

  return data
Wahib Mzali
  • 120
  • 5