Text Pre-processing with NLTK

Question

I am practicing on using NLTK to remove certain features from raw tweets and subsequently hoping to remove tweets that are (to me) irelevant (e.g. empty tweet or single word tweets). However, it seems that some of the single word tweets are not removed. I am also facing an issue with not able to remove any stopword that are either at the beginning or end of sentence.

Any advice? At the moment, I hope to pass back a sentence as an output rather than a list of tokenized words.

Any other comment on improving the code (processing time, elegance) are welcome.

import string
import numpy as np
import nltk
from nltk.corpus import stopwords

cache_english_stopwords=stopwords.words('english')
cache_en_tweet_stopwords=stopwords.words('english_tweet')

# For clarity, df is a pandas dataframe with a column['text'] together with other headers.

def tweet_clean(df):
    temp_df = df.copy()
    # Remove hyperlinks
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('https?:\/\/.*\/\w*', '', regex=True)
    # Remove hashtags
    # temp_df.loc[:,"text"]=temp_df.loc[:,"text"].replace('#\w*', '', regex=True)
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('#', ' ', regex=True)
    # Remove citations
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('\@\w*', '', regex=True)
    # Remove tickers
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('\$\w*', '', regex=True)
    # Remove punctuation
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('[' + string.punctuation + ']+', '', regex=True)
    # Remove stopwords
    for tweet in temp_df.loc[:,"text"]:
        tweet_tokenized=nltk.word_tokenize(tweet)
        for w in tweet_tokenized:
            if (w.lower() in cache_english_stopwords) | (w.lower() in cache_en_tweet_stopwords):
                temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('[\W*\s?\n?]'+w+'[\W*\s?]', ' ', regex=True)
                #print("w in stopword")
    # Remove quotes
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('\&*[amp]*\;|gt+', '', regex=True)
    # Remove RT
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('\s+rt\s+', '', regex=True)
    # Remove linebreak, tab, return
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('[\n\t\r]+', ' ', regex=True)
    # Remove via with blank
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('via+\s', '', regex=True)
    # Remove multiple whitespace
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('\s+\s+', ' ', regex=True)
    # Remove single word sentence
    for tweet_sw in temp_df.loc[:, "text"]:
        tweet_sw_tokenized = nltk.word_tokenize(tweet_sw)
        if len(tweet_sw_tokenized) <= 1:
            temp_df.loc["text"] = np.nan
    # Remove empty rows
    temp_df.loc[(temp_df["text"] == '') | (temp_df['text'] == ' ')] = np.nan
    temp_df = temp_df.dropna()
    return temp_df

Please edit **your question** to explain that `df` is a pandas dataframe (as i gathered from your comments). Ideally you should add a few lines of code so that there's a complete snippet that could be run if someone wants to provide you with a better answer. And if you decide that your own self-answer has solved your problem, you should eventually mark it as "accepted". (But first I'd fix your question, and wait for better answers). — alexis, Sep 14 '16 at 08:00

score 3 · Answer 1 · edited May 23 '17 at 12:15

What is df? a list of tweets? You maybe should consider cleaning the tweet one after the other and not as a list of tweets. It would be easier to have a function tweet_cleaner(single_tweet).

nltk provides a TweetTokenizer to clean the tweets.

the "re" package provides good solutions to use regex.

I advice you to create a variable for an easier use of temp_df.loc[:, "text"]

Deleting stopwords in a sentence is described [here] (Stopword removal with NLTK): clean_wordlist = [i for i in sentence.lower().split() if i not in stopwords]

If you want to use regex (with the re package), you can

create a regex pattern composed of all the stopwords (out of the tweet_clean function): stop_pattern = re.compile('|'.join(stoplist)(?siu))
(?siu) for multiline, ignorecase, unicode
and use this pattern to clean any string clean_string = stop_pattern.sub('', input_string)

(you can concatenate the 2 stoplists if having separate ones is not needed)

To remove 1 words tweet you could only keep the one longest than 1 word:
if len(tweet_sw_tokenized) >= 1: kept_ones.append(tweet_sw)

Copied from my mess of codes. df is a pandas.dataframe and within it, there is a "text" column.Prefer to have separate stoplist for different types of tweets so I don't mess with the original nltk stoplist. — Harris, Sep 12 '16 at 16:39

score 3 · Answer 2 · answered Sep 12 '16 at 19:06

With advice from mquantin, I have modified my code to clean tweets individually as a sentence. Here is my amateur attempt with a sample tweet that I believe covers most scenarios (Let me know if you encounter any other cases that deserve a clean up):

import string
import re
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer


cache_english_stopwords=stopwords.words('english')



def tweet_clean(tweet):
    # Remove tickers
    sent_no_tickers=re.sub(r'\$\w*','',tweet)
    print('No tickers:')
    print(sent_no_tickers)
    tw_tknzr=TweetTokenizer(strip_handles=True, reduce_len=True)
    temp_tw_list = tw_tknzr.tokenize(sent_no_tickers)
    print('Temp_list:')
    print(temp_tw_list)
    # Remove stopwords
    list_no_stopwords=[i for i in temp_tw_list if i.lower() not in     cache_english_stopwords]
    print('No Stopwords:')
    print(list_no_stopwords)
    # Remove hyperlinks
    list_no_hyperlinks=[re.sub(r'https?:\/\/.*\/\w*','',i) for i in list_no_stopwords]
    print('No hyperlinks:')
    print(list_no_hyperlinks)
    # Remove hashtags
    list_no_hashtags=[re.sub(r'#', '', i) for i in list_no_hyperlinks]
    print('No hashtags:')
    print(list_no_hashtags)
    # Remove Punctuation and split 's, 't, 've with a space for filter
    list_no_punctuation=[re.sub(r'['+string.punctuation+']+', ' ', i) for i in list_no_hashtags]
    print('No punctuation:')
    print(list_no_punctuation)
    # Remove multiple whitespace
    new_sent = ' '.join(list_no_punctuation)
    # Remove any words with 2 or fewer letters
    filtered_list = tw_tknzr.tokenize(new_sent)
    list_filtered = [re.sub(r'^\w\w?$', '', i) for i in filtered_list]
    print('Clean list of words:')
    print(list_filtered)
    filtered_sent =' '.join(list_filtered)
    clean_sent=re.sub(r'\s\s+', ' ', filtered_sent)
    #Remove any whitespace at the front of the sentence
    clean_sent=clean_sent.lstrip(' ')
    print('Clean sentence:')
    print(clean_sent)

s0='    RT @Amila #Test\nTom\'s newly listed Co. &amp; Mary\'s unlisted     Group to supply tech for nlTK.\nh.. $TSLA $AAPL https:// t.co/x34afsfQsh'
tweet_clean(s0)

Change `cache_english_stopwords` to a set. It'll speed up your code by an incredible factor. — alexis, Sep 14 '16 at 08:04
How do i read in a file that contains multiple lines of tweets, project3.txt or .json file — Silas, Apr 27 '17 at 15:25
I was playing around with this code and I noticed parts of the URL remain in the final tweet. I made a few changes to solve this problem and also improved the structure: https://pastebin.com/iHw1CHXM — hb20007, Mar 22 '18 at 15:47

Text Pre-processing with NLTK

2 Answers2