0

I have a large dataframe in R consisting of Twitter data (time of posting, user ID number, tweet text, etc.). I want to collapse all observations with identical tweet text, text, into a single observation while counting how many times that message appears in the dataframe. In other words, if a message The cat is in the tree appears in the dataframe 12 times, I want to create a dataframe where only the first time it is posted appears, but with a column that says 12 next to the message.

How might I do this?


Here is my reproducible data:

`structure(list(timestamp = structure(c(1446241090, 1446241086, 
1446241094, 1446241107, 1446241158, 1446241132, 1446241181, 1446241202, 
1446241209, 1446241304, 1446241318, 1446241327, 1446241297, 1446241345, 
1446241530, 1446241382, 1446241624, 1446241577, 1446241707, 1446241583, 
1446241739, 1446241739, 1446241602, 1446241682, 1446241687, 1446241773, 
1446241703, 1446241664, 1446241842, 1446241696), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), id_str = c(660209102790914048, 660209083505504256, 
660209119152893952, 660209170730258432, 660209385713573888, 660209278498824192, 
660209483080056832, 660209569935835136, 660209601502162944, 660209999600332800, 
660210059658432512, 660210094819311616, 660209967971115008, 660210170643922944, 
660210945285746688, 660210326596681728, 660211339168714752, 660211145849053184, 
660211690164744192, 660211169035161600, 660211824374231040, 660211825049497600, 
660211250324992000, 660211583772045312, 660211603237834752, 660211966766620672, 
660211670568988672, 660211508937363456, 660212253707358208, 660211641858916352
), user.id_str = c(68956490, 68956490, 949996219, 68956490, 1665986042, 
529591144, 20809182, 135909586, 20118515, 2327500422, 2382485564, 
1881559508, 2403408967, 949996219, 124533535, 14545416, 347334263, 
711042272, 68956490, 152240878, 1723563360, 1723563360, 135909586, 
68956490, 68956490, 419665502, 68956490, 17374940, 112219846, 
68956490), user.followers_count = c(15227, 15227, 2214, 15227, 
756, 3608, 1121, 721, 13484, 321, 188, 886, 1446, 2214, 1076, 
2310, 1754, 995, 15228, 1269, 7983, 7983, 721, 15228, 15228, 
2075, 15228, 955, 635, 15228), ideology = c(2.29286233202781, 
2.29286233202781, -0.309303177803536, 2.29286233202781, -0.778438324479111, 
2.16242522348951, -0.908875433017413, -0.699518393262659, 1.62423513699954, 
0.417417855481292, 1.12769723642936, 0.600468251497229, -0.907779322861629, 
-0.309303177803536, -0.59977236908631, 1.54860353625044, 1.76234501662833, 
-0.0111612154302728, 2.29286233202781, 0.112699232173325, -0.306014847336183, 
-0.306014847336183, -0.699518393262659, 2.29286233202781, 2.29286233202781, 
-0.749939460428726, 2.29286233202781, -0.83214772211253, -0.863934916630267, 
2.29286233202781), text = c("better dead than red bill gates says that only socialism can save us from climate change httpstcopypqmd1fok", "i see red people bill gates says that only socialism can save us from climate change httpstcopypqmd1fok", "this is an amusing headline bill gates says that only socialism can save us from climate change httpstcobfs5zbcijc", "communist core bill gates says that only socialism can save us from climate change httpstcogqm7k64f0r", "this is an amusing headline bill gates says that only socialism can save us from climate change httpstcobfs5zbcijc", "better dead than red bill gates says that only socialism can save us from climate change httpstcopypq", "lights camera climate change action showing of the carbon negative good allpowerlabs httpstcobpxdhjqw5i", "this is an amusing headline bill gates says that only socialism can save us from climate change httpstcobfs5zbcijc", "whether climate change is real or not we must all do our part to care for mother earth it is our gift from god httpstcotpjerkfu5u", "soot no doubt volcanic in origin greenland ice melt due to global warming found not so bad  httpstcoyqsmd6d4sm via", "this is an amusing headline bill gates says that only socialism can save us from climate change httpstcobfs5zbcijc", 
"the arctics got a serious chemtrails problem looks pretty bad again today\r\n\r\nclimatechange geoengineering httpstco1ls", "bill gates says that only socialism can save us from climate change httpstcoe5psgltj59 httpstcop9oye6sipx", "naomiaklein looks like the headline changed already bill gates says that capitalism cannot save us from climate chang", "bill gates says that only socialism can save us from climate change httpstcovjvklrncwq httpstcodtnjg7e0rz", "not to ruin halloween or anything but obama wants to take a moment to remind you that your jack olantern is causing globalwarming", 
"save on green hosting from hostgator use 25 off coupon code get25offhg httpstcohlk3yp1eew webhost webhosting climatechange", "bill gates says that only socialism can save us from climate change httpstcosspebdd3m9 httpstcovttupglukt", "bring back huac bill gates says that only socialism can save us from climate change httpstcospnurbgevy", "bill gates says that only socialism can save us from climate change httpstcovjvklrncwq httpstcodtnjg7e0rz", "what happens when you ingore climate good science amp management plans fail exhibit a atlantic cod science https", "what happens when you ingore climate good science amp management plans fail exhibit a atlantic cod science https", "naomiaklein looks like the headline changed already bill gates says that capitalism cannot save us from climate chang", "party like 1989 bill gates says that only socialism can save us from climate change httpstcopypqmd1fok", "dawn of the red bill gates says that only socialism can save us from climate change httpstcopypqmd1fok", "bill gates says that only socialism can save us from climate change httpstco8khrx6cgmd", "light up the tilt bill gates says that only socialism can save us from climate change httpstcopypqmd1fok", "annegalloway good  but maybe impossible to answer wout knowing the context ie war climate change other nonhumans that may be harmed", "bill gates says that only socialism can save us from climate change httpstcoe5psgltj59 httpstcop9oye6sipx", "unenjoyment line bill gates says that only socialism can save us from climate change httpstcobfstykngx4"), dup_text = c(FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, 
TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, 
FALSE, FALSE, FALSE, TRUE, FALSE), dup_clean_text = c(FALSE, 
FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, 
TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, 
TRUE, FALSE), dup_user = c(FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, 
FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, 
FALSE, TRUE, FALSE, FALSE, TRUE)), .Names = c("timestamp", "id_str", 
"user.id_str", "user.followers_count", "ideology", "text", "dup_text", 
"dup_clean_text", "dup_user"), row.names = c(NA, -30L), class = c("tbl_df", 
"tbl", "data.frame"))`
user72716
  • 263
  • 3
  • 22

1 Answers1

1

Here with your data and dplyr:

data_text %>%
group_by(text) %>%                # group by tweet
summarise(freq = n(),             # count the occurencies
          date = min(timestamp))  # the first date the tweet appears

# A tibble: 23 x 3
   text                                                                                                                       freq date               
   <chr>                                                                                                                     <int> <dttm>             
 1 annegalloway good  but maybe impossible to answer wout knowing the context ie war climate change other nonhumans that ma~     1 2015-10-30 21:47:44
 2 better dead than red bill gates says that only socialism can save us from climate change httpstcopypq                         1 2015-10-30 21:38:52
 3 better dead than red bill gates says that only socialism can save us from climate change httpstcopypqmd1fok                   1 2015-10-30 21:38:10
 4 bill gates says that only socialism can save us from climate change httpstco8khrx6cgmd                                        1 2015-10-30 21:49:33
 5 bill gates says that only socialism can save us from climate change httpstcoe5psgltj59 httpstcop9oye6sipx                     2 2015-10-30 21:41:37
 6 bill gates says that only socialism can save us from climate change httpstcosspebdd3m9 httpstcovttupglukt                     1 2015-10-30 21:46:17
 7 bill gates says that only socialism can save us from climate change httpstcovjvklrncwq httpstcodtnjg7e0rz                     2 2015-10-30 21:45:30
 8 bring back huac bill gates says that only socialism can save us from climate change httpstcospnurbgevy                        1 2015-10-30 21:48:27
 9 communist core bill gates says that only socialism can save us from climate change httpstcogqm7k64f0r                         1 2015-10-30 21:38:27
10 dawn of the red bill gates says that only socialism can save us from climate change httpstcopypqmd1fok                        1 2015-10-30 21:48:07
# ... with 13 more rows   

Maybe you would consider to remove all the https things, to have more consistent results:

data_text %>%
  mutate (cleaned_up = gsub("https\\w+ *", "", text)) %>%  # remove all the "https..things" adding a cleaned up column
  group_by(cleaned_up) %>%                                 # group by tweet
  summarise(freq = n(),                                    # count the occurencies
            date = min(timestamp))  %>% 
   arrange(-freq) %>%                                      # order desc 
  head()                                                   # take the top 6 

   # A tibble: 6 x 3
  cleaned_up                                                                                                                  freq date               
  <chr>                                                                                                                      <int> <dttm>             
1 "bill gates says that only socialism can save us from climate change "                                                         6 2015-10-30 21:41:37
2 "this is an amusing headline bill gates says that only socialism can save us from climate change "                             4 2015-10-30 21:38:14
3 "better dead than red bill gates says that only socialism can save us from climate change "                                    2 2015-10-30 21:38:10
4 naomiaklein looks like the headline changed already bill gates says that capitalism cannot save us from climate chang          2 2015-10-30 21:42:25
5 what happens when you ingore climate good science amp management plans fail exhibit a atlantic cod science https               2 2015-10-30 21:48:59
6 annegalloway good  but maybe impossible to answer wout knowing the context ie war climate change other nonhumans that may~     1 2015-10-30 21:47:44
s__
  • 9,270
  • 3
  • 27
  • 45
  • You've got the right concept and I think it should work, but I get an error (`Error in summarise_impl(.data, dots) : Evaluation error: invalid 'type' (closure) of argument.`when I try to apply it on my own data. I think I've fixed my reproducible data if you could please take a look. It had URLs in the text which wasn't allowed, so I've just used the cleaned text column now. – user72716 Nov 30 '18 at 10:24
  • Could you add a sample of your data that creates that error? also few lines, to copy and paste in R editing your question. Maybe your dates are not of class date? You can try to know if you do `class(mydata$date_field)`, after seing if the `date=min(time)` is the problem, commenting it. – s__ Nov 30 '18 at 10:26
  • Ah, yeah my dates are apparently class `POSIXt`, which I've actually never seen before. Since my dataframe is organized by time (first observations in the dataframe are the first tweets posted), is it possible to just work around this and collapse/count the observations based on which observation occurs first in the dataframe? – user72716 Nov 30 '18 at 10:37
  • 1
    The fact is that grouping is not taking care of the date, simple groups the equals and count the occurrencies and, if you think about, the first tweet is equal to the last if they are grouped, making maybe not useful thinking about a complex way to have the first. In other hands, if you need to have the first date, you have to convert the date, and apply the code mentioned above, min take the smallest date i.e. the first one. To convert read [this](https://stackoverflow.com/questions/16557028/date-conversion-from-posixct-to-date-in-r) and ask if you have any problems. – s__ Nov 30 '18 at 10:41
  • Edited with your new data and it works, also added a point that maybe you can find useful. – s__ Nov 30 '18 at 11:29