1

I have started learning R and I am currently familiarising myself with a the text mining package.

I have started a project where I have exported a chat log from a group chat and have cleansed the data in Excel into 4 rows.

I have been able to collapse all messages into one and create a wordcloud of the most used words.

EDIT: Apologies for the vague submission

The first step I have done was cleaned the data to follow a simple structure as below

Date            Time        Sender      Message
01/01/2019      09:54:03    Person 1    Hello
01/01/2019      10:55:03    Person 2    Hello
01/01/2019      11:56:03    Person 3    Hello
01/01/2019      12:57:03    Person 4    Hello

using the tm and wordcloud packages, I have successfully put together a wordcloud of the most common words over the past year across all members of the chat, through the following method.

    library(tm)
    library(readr)

    Chat <- read.csv("ChatExport_Cleansed.csv", stringsAsFactors = FALSE)

    chat_text <- paste(Chat$Message, collapse=" ")

    chat_source <- VectorSource(chat_text)

    corpus <- Corpus(chat_source)

    corpus <- tm_map(corpus, content_transformer(tolower))
    corpus <- tm_map(corpus, removePunctuation)
    corpus <- tm_map(corpus, stripWhitespace)
    corpus <- tm_map(corpus, removeWords, stopwords("english"))

    dtm <- DocumentTermMatrix(corpus)
    dtm2 <- as.matrix(dtm)

    frequency <- colSums(dtm2)
    frequency <- sort(frequency, decreasing=TRUE)

words <- names(frequency)
color_pal <- viridis(n =20)
wordcloud(words[1:300], frequency[1:900], 
          random.order=TRUE, random.color=TRUE,colors = color_pal)

The next things that I wish to explore are the most frequent words per sender, as well as the sender that has sent the most messages during certain hours (9-5).

The output would look something like

Sender 1: Most frequent word Sender 2: Most frequent word etc

I would also like to see an output of the number of times each sender sent a message between 9am - 5pm.

I am not sure how to achieve this, is it possible to use the collapse function to breakdown messages into one large vector per sender?

Thank you for any advice in advance!

TomCicco
  • 13
  • 3
  • 1
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Have you tried anything yet? Where exactly are you getting stuck? This question seems a bit too unfocused at the moment. – MrFlick Jan 15 '20 at 19:51
  • Apologies for the vague entry, I have updated this! – TomCicco Jan 15 '20 at 20:15
  • @TomCicco, you should read the link MrFlick has attached to his/her comments, it will explain how to provide a reproducible example that we can all use in our R session with a simple copy/paste. – dc37 Jan 15 '20 at 21:16

1 Answers1

0

Starting with the following dataframe:

head(df)
# A tibble: 6 x 4
  Date       Time     Sender  Message
  <date>     <chr>    <chr>   <fct>  
1 2020-01-01 00:00:00 Person1 C      
2 2020-01-01 01:00:00 Person1 C      
3 2020-01-01 02:00:00 Person1 B      
4 2020-01-01 03:00:00 Person1 B      
5 2020-01-01 04:00:00 Person1 C      
6 2020-01-01 05:00:00 Person1 E   

You can first filter for specific hours by setting a Date_Time column using lubridate package and the function ymd_hms and use the filter function from dplyr to get only message sent between 9 AM and 5 PM.

library(lubridate)
library(dplyr)
df %>% mutate(Date_Time = ymd_hms(paste(Date, Time))) %>%
  filter(hour(Date_Time) >= 9 & hour(Date_Time) <= 17)

# A tibble: 18 x 5
   Date       Time     Sender  Message Date_Time          
   <date>     <chr>    <chr>   <fct>   <dttm>             
 1 2020-01-01 09:00:00 Person1 C       2020-01-01 09:00:00
 2 2020-01-01 10:00:00 Person1 E       2020-01-01 10:00:00
 3 2020-01-01 11:00:00 Person1 C       2020-01-01 11:00:00
 4 2020-01-01 12:00:00 Person1 C       2020-01-01 12:00:00
 5 2020-01-01 13:00:00 Person1 A       2020-01-01 13:00:00
 6 2020-01-01 14:00:00 Person1 D       2020-01-01 14:00:00
 7 2020-01-01 15:00:00 Person1 A       2020-01-01 15:00:00
 8 2020-01-02 16:00:00 Person1 A       2020-01-02 16:00:00
 9 2020-01-02 17:00:00 Person1 E       2020-01-02 17:00:00
10 2020-01-01 09:00:00 Person2 D       2020-01-01 09:00:00
11 2020-01-01 10:00:00 Person2 E       2020-01-01 10:00:00
12 2020-01-01 11:00:00 Person2 E       2020-01-01 11:00:00
13 2020-01-01 12:00:00 Person2 C       2020-01-01 12:00:00
14 2020-01-01 13:00:00 Person2 A       2020-01-01 13:00:00
15 2020-01-01 14:00:00 Person2 B       2020-01-01 14:00:00
16 2020-01-01 15:00:00 Person2 E       2020-01-01 15:00:00
17 2020-01-02 16:00:00 Person2 E       2020-01-02 16:00:00
18 2020-01-02 17:00:00 Person2 D       2020-01-02 17:00:00

Then, you can group_by each sender and message to calculate the frequency of each message and then filter for the maximal frequency for each sender.

df %>% mutate(Date_Time = ymd_hms(paste(Date, Time))) %>%
  filter(hour(Date_Time) >= 9 & hour(Date_Time) <= 17) %>%
  group_by(Sender, Message) %>% count() %>% 
  group_by(Sender) %>%
  filter(n == max(n))

# A tibble: 3 x 3
# Groups:   Sender [2]
  Sender  Message     n
  <chr>   <fct>   <int>
1 Person1 A           3
2 Person1 C           3
3 Person2 E           4

If you want to know the number of messages sent by each sender in a certain period of time, you can do:

df %>% mutate(Date_Time = ymd_hms(paste(Date, Time))) %>%
  filter(hour(Date_Time) >= 9 & hour(Date_Time) <= 17) %>%
  group_by(Sender) %>% count()

# A tibble: 2 x 2
# Groups:   Sender [2]
  Sender      n
  <chr>   <int>
1 Person1     9
2 Person2     9

Does it answer your question ?

Data

structure(list(Date = structure(c(18262, 18262, 18262, 18262, 
18262, 18262, 18262, 18262, 18262, 18262, 18262, 18262, 18262, 
18262, 18262, 18262, 18263, 18263, 18263, 18263, 18263, 18263, 
18263, 18263, 18263, 18262, 18262, 18262, 18262, 18262, 18262, 
18262, 18262, 18262, 18262, 18262, 18262, 18262, 18262, 18262, 
18262, 18263, 18263, 18263, 18263, 18263, 18263, 18263, 18263, 
18263), class = "Date"), Time = c("00:00:00", "01:00:00", "02:00:00", 
"03:00:00", "04:00:00", "05:00:00", "06:00:00", "07:00:00", "08:00:00", 
"09:00:00", "10:00:00", "11:00:00", "12:00:00", "13:00:00", "14:00:00", 
"15:00:00", "16:00:00", "17:00:00", "18:00:00", "19:00:00", "20:00:00", 
"21:00:00", "22:00:00", "23:00:00", "00:00:00", "00:00:00", "01:00:00", 
"02:00:00", "03:00:00", "04:00:00", "05:00:00", "06:00:00", "07:00:00", 
"08:00:00", "09:00:00", "10:00:00", "11:00:00", "12:00:00", "13:00:00", 
"14:00:00", "15:00:00", "16:00:00", "17:00:00", "18:00:00", "19:00:00", 
"20:00:00", "21:00:00", "22:00:00", "23:00:00", "00:00:00"), 
    Sender = c("Person1", "Person1", "Person1", "Person1", "Person1", 
    "Person1", "Person1", "Person1", "Person1", "Person1", "Person1", 
    "Person1", "Person1", "Person1", "Person1", "Person1", "Person1", 
    "Person1", "Person1", "Person1", "Person1", "Person1", "Person1", 
    "Person1", "Person1", "Person2", "Person2", "Person2", "Person2", 
    "Person2", "Person2", "Person2", "Person2", "Person2", "Person2", 
    "Person2", "Person2", "Person2", "Person2", "Person2", "Person2", 
    "Person2", "Person2", "Person2", "Person2", "Person2", "Person2", 
    "Person2", "Person2", "Person2"), Message = structure(c(3L, 
    3L, 2L, 2L, 3L, 5L, 4L, 1L, 2L, 3L, 5L, 3L, 3L, 1L, 4L, 1L, 
    1L, 5L, 3L, 2L, 2L, 1L, 3L, 4L, 1L, 3L, 5L, 4L, 2L, 5L, 1L, 
    1L, 2L, 3L, 4L, 5L, 5L, 3L, 1L, 2L, 5L, 5L, 4L, 5L, 2L, 1L, 
    1L, 3L, 1L, 5L), .Label = c("A", "B", "C", "D", "E"), class = "factor")), row.names = c(NA, 
-50L), class = c("tbl_df", "tbl", "data.frame"))
dc37
  • 15,840
  • 4
  • 15
  • 32