14

I extracted tweets from twitter using the twitteR package and saved them into a text file.

I have carried out the following on the corpus

xx<-tm_map(xx,removeNumbers, lazy=TRUE, 'mc.cores=1')
xx<-tm_map(xx,stripWhitespace, lazy=TRUE, 'mc.cores=1')
xx<-tm_map(xx,removePunctuation, lazy=TRUE, 'mc.cores=1')
xx<-tm_map(xx,strip_retweets, lazy=TRUE, 'mc.cores=1')
xx<-tm_map(xx,removeWords,stopwords(english), lazy=TRUE, 'mc.cores=1')

(using mc.cores=1 and lazy=True as otherwise R on mac is running into errors)

tdm<-TermDocumentMatrix(xx)

But this term document matrix has a lot of strange symbols, meaningless words and the like. If a tweet is

 RT @Foxtel: One man stands between us and annihilation: @IanZiering.
 Sharknado‚Äã 3: OH HELL NO! - July 23 on Foxtel @SyfyAU

After cleaning the tweet I want only proper complete english words to be left , i.e a sentence/phrase void of everything else (user names, shortened words, urls)

example:

One man stands between us and annihilation oh hell no on 

(Note: The transformation commands in the tm package are only able to remove stop words, punctuation whitespaces and also conversion to lowercase)

kRazzy R
  • 1,561
  • 1
  • 16
  • 44
  • then, `sharknado` and `foxtel` would be done, since they're not "proper" english words... – Marc B Jul 10 '15 at 19:06
  • Do you see any improvement if you use, e.g., `xx <- tm_map(xx, content_transformer(removePunctuation)` or `xx <- tm_map(xx, content_transformer(tolower))`? – RHertel Jul 10 '15 at 19:14
  • The precise syntax may depend on the version number of the `tm` package that you have installed. – RHertel Jul 10 '15 at 19:15

5 Answers5

18

Using gsub and

stringr package

I have figured out part of the solution for removing retweets, references to screen names, hashtags, spaces, numbers, punctuations, urls .

  clean_tweet = gsub("&amp", "", unclean_tweet)
  clean_tweet = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", clean_tweet)
  clean_tweet = gsub("@\\w+", "", clean_tweet)
  clean_tweet = gsub("[[:punct:]]", "", clean_tweet)
  clean_tweet = gsub("[[:digit:]]", "", clean_tweet)
  clean_tweet = gsub("http\\w+", "", clean_tweet)
  clean_tweet = gsub("[ \t]{2,}", "", clean_tweet)
  clean_tweet = gsub("^\\s+|\\s+$", "", clean_tweet) 

ref: ( Hicks , 2014) After the above I did the below.

 #get rid of unnecessary spaces
clean_tweet <- str_replace_all(clean_tweet," "," ")
# Get rid of URLs
clean_tweet <- str_replace_all(clean_tweet, "http://t.co/[a-z,A-Z,0-9]*{8}","")
# Take out retweet header, there is only one
clean_tweet <- str_replace(clean_tweet,"RT @[a-z,A-Z]*: ","")
# Get rid of hashtags
clean_tweet <- str_replace_all(clean_tweet,"#[a-z,A-Z]*","")
# Get rid of references to other screennames
clean_tweet <- str_replace_all(clean_tweet,"@[a-z,A-Z]*","")   

ref: (Stanton 2013)

Before doing any of the above I collapsed the whole string into a single long character using the below.

paste(mytweets, collapse=" ")

This cleaning process has worked for me quite well as opposed to the tm_map transforms.

All that I am left with now is a set of proper words and a very few improper words. Now, I only have to figure out how to remove the non proper english words. Probably i will have to subtract my set of words from a dictionary of words.

kRazzy R
  • 1,561
  • 1
  • 16
  • 44
  • 1
    This works great, but make sure you don't use `clean_tweet` in your argument, if you don't want to overwrite the variable! – timothyjgraham Mar 10 '16 at 06:57
  • 1
    Also make sure the order is correct. If you first remove the mentions, and then do the RT check (`clean_tweet <- str_replace(clean_tweet,"RT @[a-z,A-Z]*: ","")`) it won't find anything, because the `@` isnt there anymore – Mathias711 Sep 15 '16 at 14:10
8

        library(tidyverse)    
        
        clean_tweets <- function(x) {
                    x %>%
                            # Remove URLs
                            str_remove_all(" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)") %>%
                            # Remove mentions e.g. "@my_account"
                            str_remove_all("@[[:alnum:]_]{4,}") %>%
                            # Remove hashtags
                            str_remove_all("#[[:alnum:]_]+") %>%
                            # Replace "&" character reference with "and"
                            str_replace_all("&amp;", "and") %>%
                            # Remove puntucation, using a standard character class
                            str_remove_all("[[:punct:]]") %>%
                            # Remove "RT: " from beginning of retweets
                            str_remove_all("^RT:? ") %>%
                            # Replace any newline characters with a space
                            str_replace_all("\\\n", " ") %>%
                            # Make everything lowercase
                            str_to_lower() %>%
                            # Remove any trailing whitespace around the text
                            str_trim("both")
            }
    
        tweets %>% clean_tweets
RDRR
  • 860
  • 13
  • 16
  • 1
    Would it be possible to get comments on what is being removed in each step? I am currently learning about Regex but still have issue identifying some expressions. Thanks – k3r0 Jun 04 '21 at 06:11
  • 2
    @k3r0 - I've added comments to each step to clarify what it's doing more clearly – RDRR Jun 05 '21 at 11:05
  • 1
    I read a bit on it and figured out some of them but not all. Wasn't really familiar with executing functions so that was a good learn too. Thanks for that! – k3r0 Jun 05 '21 at 11:15
2

To remove the URLs you could try the following:

removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
xx <- tm_map(xx, removeURL)

Possibly you could define similar functions to further transform the text.

RHertel
  • 23,412
  • 5
  • 38
  • 64
1

For me, this code did not work, for some reason-

# Get rid of URLs
clean_tweet <- str_replace_all(clean_tweet, "http://t.co/[a-z,A-Z,0-9]*{8}","")

Error was-

Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement),  : 
 Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)

So, instead, I used

clean_tweet4 <- str_replace_all(clean_tweet3, "https://t.co/[a-z,A-Z,0-9]*","")
clean_tweet5 <- str_replace_all(clean_tweet4, "http://t.co/[a-z,A-Z,0-9]*","")

to get rid of URLs

Cur123
  • 89
  • 1
  • 8
0

The code do some basic cleaning

Converts into lowercase

df <- tm_map(df, tolower)  

Removing Special characters

df <- tm_map(df, removePunctuation)

Removing Special characters

df <- tm_map(df, removeNumbers)

Removing common words

df <- tm_map(df, removeWords, stopwords('english'))

Removing URL

removeURL <- function(x) gsub('http[[:alnum;]]*', '', x)
kRazzy R
  • 1,561
  • 1
  • 16
  • 44