1

I have a large dataframe (~500,000 observations) consisting of structured Twitter data (i.e. username, rewtweet counts, text) in RStudio. I want to run a text analysis on the tweets so I can extract observations that have one or more keywords in the tweet text.

I have uploaded my keywords as keywords_C <- c("climate change","climate","climatechange","global warming","globalwarming") . Tweet text is stored in my dataframe in a column labelled text.

How do I make a new dataframe containing only observations where one or more of the keywords are present in the text column? Alternatively, can I delete observations where the keywords are not present?


My dataframe is called NewCData

dput(droplevels(head(NewCData, 10)))

  structure(list(timestamp = structure(c(1L, 3L, 2L, 6L, 4L, 4L, 
    5L, 8L, 7L, 9L), .Label = c("2015-10-30 21:37:58", "2015-10-30 21:38:02", 
    "2015-10-30 21:38:03", "2015-10-30 21:38:06", "2015-10-30 21:38:07", 
    "2015-10-30 21:38:10", "2015-10-30 21:38:14", "2015-10-30 21:38:32", 
    "2015-10-30 21:39:04"), class = "factor"), id_str = structure(c(1L, 
    3L, 2L, 7L, 4L, 5L, 6L, 9L, 8L, 10L), .Label = c("660209050429186048", 
    "660209067584016384", "660209072768212992", "660209083505504256", 
    "660209086143688704", "660209087628578816", "660209102790914048", 
    "660209119152893952", "660209195162206208", "660209325986549760"
    ), class = "factor"), user.id_str = structure(c(1L, 3L, 8L, 5L, 
    5L, 2L, 4L, 6L, 9L, 7L), .Label = c("277335277", "32380087", 
    "325105950", "33398863", "68956490", "808114195", "87712431", 
    "90280824", "949996219"), class = "factor"), user.followers_count = structure(c(7L, 
    2L, 8L, 4L, 4L, 3L, 6L, 9L, 5L, 1L), .Label = c("10212", "1062", 
    "1389", "15227", "2214", "2851", "38", "4137", "55"), class = "factor"), 
        ideology = structure(c(2L, 4L, 3L, 9L, 9L, 5L, 8L, 6L, 1L, 
        7L), .Label = c("-0.309303177803536", "-0.393703659798908", 
        "-0.795976086971656", "-0.811321629152632", "-0.946143178314071", 
        "-1.16317298915931", "0.353843466445817", "1.09919837237897", 
        "2.29286233202781"), class = "factor"), text = structure(c(2L, 
        9L, 4L, 1L, 3L, 10L, 5L, 7L, 6L, 8L), .Label = c("Better Dead than Red! Bill Gates says that only socialism can save us ", 
        "Expert briefing on  #disarmament #SDGs @NMUN ", 
        "I see red people Bill Gates says that only socialism can save us from climate change ", 
        "RT: Oddly enough, some Republicans think climate change is real: Oddly enough,…  #UniteBlue ", 
        "Ted Cruz: ‘Climate change is not science, it’s religion’  via @glennbeck", 
        "This is an amusing headline: \"Bill Gates says that only socialism can save us from climate change\"", 
        "Unusual Weather Kills Gulf of Maine Cod : Discovery News #globalwarming  ", 
        "What do the remaining Republican candidates have to say about climate change? #FixGov", 
        "Who Uses #NASA Earth Science Data? He looks at impact of #aerosols on #climate #weather!", 
        "Why go for ecosystem basses conservation! #ClimateChange #Raajje #Maldives"
        ), class = "factor")), .Names = c("timestamp", "id_str", 
    "user.id_str", "user.followers_count", "ideology", "text"), row.names = c(NA, 
    10L), class = "data.frame")
user72716
  • 263
  • 3
  • 22
  • Can you please share a reproducible example? Use `dput(head(twitterData,10))` and add the result to the question. Or `dput(droplevels(head(twitterData, 10)))` if your data frame has a factor with many levels. See [How to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Wiktor Stribiżew Nov 08 '18 at 07:47
  • My apologies, but I'm not sure I understand your request. Perhaps you could elaborate on what information you need from me? – user72716 Nov 08 '18 at 09:48
  • See https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example. In order to help you a portion of your input data is necessary together with the expected result. – Wiktor Stribiżew Nov 08 '18 at 09:48
  • I couldn't get an understandable output from the `dput` function you suggested (sorry, I'm quite inexperienced), but I added more info in my question. Please let me know if this makes sense. – user72716 Nov 08 '18 at 11:13
  • It is not usable. Add the output you got from`dput` as is. You do not need to understand it. – Wiktor Stribiżew Nov 08 '18 at 11:14
  • Added `dput` output – user72716 Nov 08 '18 at 11:33
  • Try `new_df <- NewCData[with(NewCData, grepl(paste0("\\b(?:",paste(keywords_C, collapse="|"),")\\b"), text)),]` – Wiktor Stribiżew Nov 08 '18 at 11:41
  • Amazing! Thank you for bearing with me Wiktor! – user72716 Nov 08 '18 at 11:49

1 Answers1

1

You may use

new_df <- NewCData[with(NewCData, grepl(paste0("\\b(?:",paste(keywords_C, collapse="|"),")\\b"), text)),]

See the R demo online

The point here is to combine the keywords into a pattern like

\b(?:climate change|climate|climatechange|global warming|globalwarming)\b

It will match the words as whole words and if there is a match in the text column, the row will be returned, else, the row will get discarded.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563