0

Here is a function that a apply to my dataframe I have a csv file named '100-contacts' on my computer, and this file contains information about mails, such as first name, address, city, etc. My goal is to detect spam mails. I need to clean the data from stopwords and punctuation , this part of code would have helped me but I got a KeyError despite existing Key.

def process_text(text):
  #1 Remove puntcuation 
  #2 Remove stopwords
  #3 Return a list of clean text words

  #1
  nopunc = [char for char in text if char not in string.punctuation]
  nopunc = ' '.join(nopunc)

  #2
  clean_words = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

  #3
  return clean_words

df['text'].head().apply(process_text)
char
  • 2,063
  • 3
  • 15
  • 26

1 Answers1

0

You might have spaces in your column names. Adding sep=r'\s*,\s*' when reading the CSV into the DataFrame might help.

import pandas as pd
import string
from nltk.corpus import stopwords

# csv.csv
# name, age, text
# aa, 11, randomtext
# bb, 22, randomtexttext
# cc, 33, ra..ndo..mtexttext
df = pd.read_csv('csv.csv', header=0, sep=r'\s*,\s*')

def process_text(text):
  #1 Remove puntcuation
  #2 Remove stopwords
  #3 Return a list of clean text words

  #1
  nopunc = [char for char in text if char not in string.punctuation]
  nopunc = ' '.join(nopunc)

  #2
  clean_words = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

  #3
  return clean_words

print(df['text'].head().apply(process_text))
char
  • 2,063
  • 3
  • 15
  • 26