How do I clean twitter data in R?

Question

I extracted tweets from twitter using the twitteR package and saved them into a text file.

I have carried out the following on the corpus

xx<-tm_map(xx,removeNumbers, lazy=TRUE, 'mc.cores=1')
xx<-tm_map(xx,stripWhitespace, lazy=TRUE, 'mc.cores=1')
xx<-tm_map(xx,removePunctuation, lazy=TRUE, 'mc.cores=1')
xx<-tm_map(xx,strip_retweets, lazy=TRUE, 'mc.cores=1')
xx<-tm_map(xx,removeWords,stopwords(english), lazy=TRUE, 'mc.cores=1')

(using mc.cores=1 and lazy=True as otherwise R on mac is running into errors)

tdm<-TermDocumentMatrix(xx)

But this term document matrix has a lot of strange symbols, meaningless words and the like. If a tweet is

 RT @Foxtel: One man stands between us and annihilation: @IanZiering.
 Sharknado‚Äã 3: OH HELL NO! - July 23 on Foxtel @SyfyAU

After cleaning the tweet I want only proper complete english words to be left , i.e a sentence/phrase void of everything else (user names, shortened words, urls)

example:

One man stands between us and annihilation oh hell no on

(Note: The transformation commands in the tm package are only able to remove stop words, punctuation whitespaces and also conversion to lowercase)

then, `sharknado` and `foxtel` would be done, since they're not "proper" english words... — Marc B, Jul 10 '15 at 19:06
Do you see any improvement if you use, e.g., `xx <- tm_map(xx, content_transformer(removePunctuation)` or `xx <- tm_map(xx, content_transformer(tolower))`? — RHertel, Jul 10 '15 at 19:14
The precise syntax may depend on the version number of the `tm` package that you have installed. — RHertel, Jul 10 '15 at 19:15

score 18 · Accepted Answer · answered Jul 10 '15 at 23:55

Using gsub and

stringr package

I have figured out part of the solution for removing retweets, references to screen names, hashtags, spaces, numbers, punctuations, urls .

  clean_tweet = gsub("&amp", "", unclean_tweet)
  clean_tweet = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", clean_tweet)
  clean_tweet = gsub("@\\w+", "", clean_tweet)
  clean_tweet = gsub("[[:punct:]]", "", clean_tweet)
  clean_tweet = gsub("[[:digit:]]", "", clean_tweet)
  clean_tweet = gsub("http\\w+", "", clean_tweet)
  clean_tweet = gsub("[ \t]{2,}", "", clean_tweet)
  clean_tweet = gsub("^\\s+|\\s+$", "", clean_tweet)

ref: ( Hicks , 2014) After the above I did the below.

 #get rid of unnecessary spaces
clean_tweet <- str_replace_all(clean_tweet," "," ")
# Get rid of URLs
clean_tweet <- str_replace_all(clean_tweet, "http://t.co/[a-z,A-Z,0-9]*{8}","")
# Take out retweet header, there is only one
clean_tweet <- str_replace(clean_tweet,"RT @[a-z,A-Z]*: ","")
# Get rid of hashtags
clean_tweet <- str_replace_all(clean_tweet,"#[a-z,A-Z]*","")
# Get rid of references to other screennames
clean_tweet <- str_replace_all(clean_tweet,"@[a-z,A-Z]*","")

ref: (Stanton 2013)

Before doing any of the above I collapsed the whole string into a single long character using the below.

paste(mytweets, collapse=" ")

This cleaning process has worked for me quite well as opposed to the tm_map transforms.

All that I am left with now is a set of proper words and a very few improper words. Now, I only have to figure out how to remove the non proper english words. Probably i will have to subtract my set of words from a dictionary of words.

This works great, but make sure you don't use `clean_tweet` in your argument, if you don't want to overwrite the variable! — timothyjgraham, Mar 10 '16 at 06:57
Also make sure the order is correct. If you first remove the mentions, and then do the RT check (`clean_tweet <- str_replace(clean_tweet,"RT @[a-z,A-Z]*: ","")`) it won't find anything, because the `@` isnt there anymore — Mathias711, Sep 15 '16 at 14:10

RDRR · Answer 2 · 2021-07-03T10:48:43.937


        library(tidyverse)    
        
        clean_tweets <- function(x) {
                    x %>%
                            # Remove URLs
                            str_remove_all(" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)") %>%
                            # Remove mentions e.g. "@my_account"
                            str_remove_all("@[[:alnum:]_]{4,}") %>%
                            # Remove hashtags
                            str_remove_all("#[[:alnum:]_]+") %>%
                            # Replace "&" character reference with "and"
                            str_replace_all("&amp;", "and") %>%
                            # Remove puntucation, using a standard character class
                            str_remove_all("[[:punct:]]") %>%
                            # Remove "RT: " from beginning of retweets
                            str_remove_all("^RT:? ") %>%
                            # Replace any newline characters with a space
                            str_replace_all("\\\n", " ") %>%
                            # Make everything lowercase
                            str_to_lower() %>%
                            # Remove any trailing whitespace around the text
                            str_trim("both")
            }
    
        tweets %>% clean_tweets

Would it be possible to get comments on what is being removed in each step? I am currently learning about Regex but still have issue identifying some expressions. Thanks — k3r0, Jun 04 '21 at 06:11
@k3r0 - I've added comments to each step to clarify what it's doing more clearly — RDRR, Jun 05 '21 at 11:05
I read a bit on it and figured out some of them but not all. Wasn't really familiar with executing functions so that was a good learn too. Thanks for that! — k3r0, Jun 05 '21 at 11:15

score 2 · Answer 3 · answered Jul 10 '15 at 19:33

2

To remove the URLs you could try the following:

removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
xx <- tm_map(xx, removeURL)

Possibly you could define similar functions to further transform the text.

answered Jul 10 '15 at 19:33

RHertel

23,412
5
38
64

Cur123 · Answer 4 · 2018-05-06T18:32:29.537

For me, this code did not work, for some reason-

# Get rid of URLs
clean_tweet <- str_replace_all(clean_tweet, "http://t.co/[a-z,A-Z,0-9]*{8}","")

Error was-

Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement),  : 
 Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)

So, instead, I used

clean_tweet4 <- str_replace_all(clean_tweet3, "https://t.co/[a-z,A-Z,0-9]*","")
clean_tweet5 <- str_replace_all(clean_tweet4, "http://t.co/[a-z,A-Z,0-9]*","")

to get rid of URLs

score 0 · Answer 5 · edited Mar 09 '19 at 05:09

0

The code do some basic cleaning

Converts into lowercase

df <- tm_map(df, tolower)

Removing Special characters

df <- tm_map(df, removePunctuation)

Removing Special characters

df <- tm_map(df, removeNumbers)

Removing common words

df <- tm_map(df, removeWords, stopwords('english'))

Removing URL

removeURL <- function(x) gsub('http[[:alnum;]]*', '', x)

edited Mar 09 '19 at 05:09

kRazzy R

1,561
1
16
44

answered Mar 08 '19 at 17:47

Parthiban M

49
3

How do I clean twitter data in R?

5 Answers5

Converts into lowercase

Removing Special characters

Removing Special characters

Removing common words

Removing URL

Linked