2

Trying to remove punctuation from the list of words. New to python programming so if someone could help that would be great. The purpose of this is to be used for email spam classification. Previously I had joined the words after checking to see if punctuation was present, but this gave me single characters rather than whole words. After changing it to get words this is what I have below so now trying to remove the punctuation as won't work the same as I did before.

import os
import string
from collections import Counter
from os import listdir  # return all files and folders in the directory

import nltk
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import pandas as pd
from nltk import WordNetLemmatizer
from nltk.corpus import stopwords
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

# used for importing the lingspam dataset
def importLingspamDataset(dir):
    allEmails = [] # for storing the emails once read
    fileNames = []
    for file in listdir(dir):
        f = open((dir + '/' + file), "r")  # used for opening the file in read only format
        fileNames.append(file)
        allEmails.append(f.read()) # appends the read emails to the emails array
        f.close()
    return allEmails, fileNames

def importEnronDataset(dir):
    allEmails = []  # for storing the emails once read
    fileNames = []
    for file in listdir(dir):
        f = open((dir + '/' + file), "r")  # used for opening the file in read only format
        fileNames.append(file)
        allEmails.append(f.read())  # appends the read emails to the emails array
        f.close()
        return allEmails, fileNames

# used to remove punctuation from the emails as this is of no use for detecting spam
def removePunctuation(cleanedEmails):
    punc = set(string.punctuation)
    for word, line in enumerate(cleanedEmails):
        words = line.split()
        x = [''.join(c for c in words if c not in string.punctuation)]
        allWords = []
        allWords += x
        return allWords

# used to remove stopwords i.e. words of no use in detecting spam
def removeStopwords(cleanedEmails):
    removeWords = set(stopwords.words('english')) # sets all the stopwords to be removed
    for stopw in removeWords: # for each word in remove words
        if stopw not in removeWords: # if the word is not in the stopwords to be removed
            cleanedEmails.append(stopw) # add this word to the cleaned emails
    return(cleanedEmails)

# funtion to return words to its root form - allows simplicity
def lemmatizeEmails(cleanedEmails):
    lemma = WordNetLemmatizer() # to be used for returning each word to its root form
    lemmaEmails = [lemma.lemmatize(i) for i in cleanedEmails] # lemmatize each word in the cleaned emails
    return lemmaEmails

# function to allow a systematic process of elimating the undesired elements within the emails
def cleanAllEmails(cleanedEmails):
    cleanPunc = removePunctuation(cleanedEmails)
    cleanStop = removeStopwords(cleanPunc)
    cleanLemma = lemmatizeEmails(cleanStop)
    return cleanLemma

def createDictionary(email):
    allWords = []
    allWords.extend(email)
    dictionary = Counter(allWords)
    dictionary.most_common(3000)
    word_cloud = WordCloud(width=400, height=400, background_color='white',
              min_font_size=12).generate_from_frequencies(dictionary)
    plt.imshow(word_cloud)
    plt.axis("off")
    plt.margins(x=0, y=0)
    plt.show()
    word_cloud.to_file('test1.png')

def featureExtraction(email):
     emailFiles = []
     emailFiles.extend(email)
     featureMatrix = np.zeros((len(emailFiles), 3000))


def classifyLingspamDataset(email):
    classifications = []
    for name in email:
         classifications.append("spmsg" in name)
    return classifications

# Lingspam dataset
trainingDataLingspam, trainingLingspamFilename = importLingspamDataset("spam-non-spam-dataset/train-mails") # extract the training emails from the dataset
#testingDataLingspam, testingLingspamFilename = importLingspamDataset("spam-non-spam-dataset/test-mails") # extract the testing emails from the dataset

trainingDataLingspamClean = cleanAllEmails(trainingDataLingspam)
#testingDataLingspamClean = cleanAllEmails(testingDataLingspam)

#trainClassifyLingspam = classifyLingspamDataset(trainingDataLingspam)
#testClassifyLingspam = classifyLingspamDataset(testingDataLingspam)

trainDictionary = createDictionary(trainingDataLingspamClean)
#createDictionary(testingDataLingspamClean)

#trainingDataEnron, trainingEnronFilename = importEnronDataset("spam-non-spam-dataset-enron/bigEmailDump/training/")
  • `return allWords` will return from the function on the very first iteration of the loop. This means that `allWords` will contain one list with the words of the first line – ForceBru Jun 18 '20 at 11:04
  • @hle00001 so essentially you are splitting the a line or words into a list . There are scores of solutions related to removing what you have asked. Have you seen these similar Qs, [1](https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string), [2](https://stackoverflow.com/questions/4371231/removing-punctuation-from-python-list-items), [3](https://stackoverflow.com/questions/40916263/removing-punctuation-from-a-list-in-python).. if yes and still no go, then do update your Q to reflect the same. – mnm Jun 18 '20 at 11:08
  • 1
    [this](https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string) can be helpful – Equinox Jun 18 '20 at 11:08
  • the translate doesn't work for me as I get the error later on that it is not a hashable list type. I can post the rest of my code if this is more helpful. I need to get all lines from line 2 onwards and I have multiple txt files. x = [''.join(c for c in s if c not in string.punctuation) ] using this as suggested I think removes the punctuation however I am not getting any output for my word cloud now and I am only getting one line as previously mentioned –  Jun 18 '20 at 11:22

1 Answers1

2

Based on your question, I assume that you have a list of emails, which for each email you would like to remove the punctuation marks. This answer was based on the first revision of the code you posted.

import string


def removePunctuation(emails):

    # I am using a list comprehension here to iterate over the emails.
    # For each iteration, translate the email to remove the punctuation marks.
    # Translate only allows a translation table as an argument.
    # This is why str.maketrans is used to create the translation table.

    cleaned_emails = [email.translate(str.maketrans('', '', string.punctuation))
                      for email in emails]

    return cleaned_emails


if __name__ == '__main__':

    # Assuming cleanedEmails is a list of emails, 
    # I am substituting cleanedEmails with emails.
    # I used cleanedEmails as the result.

    emails = ["This is a, test!", "This is another#@! \ntest"]
    cleaned_emails = removePunctuation(emails)
    print(cleaned_emails)
input: ["This is a, test!", "This is another#@! \ntest"]
output: ['This is a test', 'This is another \ntest']

EDIT:

Issue is resolved after having a conversation with OP. OP was having an issue with WordCloud and the solution I provided is working. Manage to guide OP through getting WordCloud working. OP is now fine tuning the results of the WordCloud.

Raymond C.
  • 572
  • 4
  • 24
  • Currently trying to run the new line of code - I have updated my code above so not sure if this will still work the same –  Jun 18 '20 at 11:40
  • If you are passing in a list of emails, it should work. You just need to copy the function removePunctuation. The bottom part of the code is for testing. – Raymond C. Jun 18 '20 at 11:41
  • I have copied the function, it seems to be taking a long while to do anything so im not 100% whether it is working correctly. Previously I was able to remove the punctuation as seen below: def removePunctuation(cleanedEmails): punc = set(string.punctuation) punctuationToRemove = "".join([i for i in cleanedEmails if i not in punc]) numbersToRemove = "".join([i for i in punctuationToRemove if i.isalpha() and len(i) == 1]) return numbersToRemove –  Jun 18 '20 at 11:58
  • however the issue with this was I was not getting whole words I was getting single characters –  Jun 18 '20 at 11:59
  • ValueError: Couldn't find space to draw. Either the Canvas size is too small or too much of the image is masked out. I have just got an error message back - I did increase the size previously however I still got the same message –  Jun 18 '20 at 12:00
  • Ok, I am a bit confused here. So what error did you receive when you ran my function? – Raymond C. Jun 18 '20 at 12:09
  • apologies I got the ValueError when running your code about it either being too small of a space or the mask –  Jun 18 '20 at 12:11
  • Do the emails have images inside them? – Raymond C. Jun 18 '20 at 12:19
  • not to my knowledge no. im not sure whether it is easier to rectify what I had previously? –  Jun 18 '20 at 12:20
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/216208/discussion-between-raymond-c-and-hle00001). – Raymond C. Jun 18 '20 at 12:24