Finding repeated sentences/words/phrases by group over time

Question

I have a data-set in which each column is a variable and each row is an observation (like time series data. It looks like this (I apologize for the format, but I can't show the data):

I'd like to know if a person or group is saying the same thing(s) over time. I'm familiar with n-grams, but it's not quite what I need. Any help would be appreciated.

This is the output I'd like:

Sorry for all the edits poor comments; still getting used to the website.

You want to know if the value is unique or How it change over time ? — Nico Coallier, Jun 15 '17 at 14:03
Please provide [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and expected output. This will make it easier for others to help you — Sotos, Jun 15 '17 at 14:04
almost sounds like you would want this in a database table with group bys, or in a data warehouse. — sniperd, Jun 15 '17 at 14:08
@Nico I'd like to know: Frequency of repeated comment as a percentage of total comments; and a breakdown by group and/or reporting person. Let's say there's another variable called "ready" with values: yes, no, maybe. Is the change in “ready” correlated with a change in comments (that is, when Ready goes from No->Yes, did the comments change from the previous report)? — Alex, Jun 15 '17 at 15:42

Nico Coallier · Accepted Answer · 2017-06-15T18:08:24.960

If you want to see the frequence of each comments related to each Person and a new column Ready you can do this with the following code :

set.seed(123456)

### I use the same data as the previous example, thank you for providing this ! 
data <-data.frame(date = Sys.Date() - sample(100),
                Group = c("Cars","Trucks") %>% sample(100,replace=T),
                Reporting_person = c("A","B","C") %>% sample(100,replace=T),
                Comments = c("Awesome","Meh","NC") %>% sample(100,replace=T),
            Ready = as.character(c("Yes","No") %>% sample(100,replace=T))
            ) 

library(dplyr)

data %>% 
    group_by(Reporting_person,Ready) %>%
    count(Comments) %>%
    mutate(prop = prop.table(n))

If what you are asking is to see if a change occurs in the comments over time and to see if that change is correlated with an event (like Ready) you can use the following code:

library(dplyr)

### Creating a column comments at time + plus
new = data %>% 
        arrange(Reporting_person,Group,date) %>%
        group_by(Group,Reporting_person) %>%
        mutate(comments_plusone=lag(Comments))

new = na.omit(new)

### Creating the change column   1 is a change , 0 no change

new$Change = as.numeric(new$Comments != new$comments_plusone)

### Get the correlation between Change and the events...

### Chi-test to test if correlation between the event and the change
### Not that using Pearson correlation is not pertinent here : 


tbl <- table(new$Ready,new$Change)

chi2 = chisq.test(tbl, correct=F)
c(chi2$statistic, chi2$p.value)
sqrt(chi2$statistic / sum(tbl))

You should get no significative correlation with this example. As you can clearly see when you illustrate the table.

plot(tbl)

Not that using cor function is not appropriate working with two binary variable.

Here a post in this topic.... Correlation between two binary

Frequence of change by change of State

Following your comments, I am adding this code:

newR = data %>% 
        arrange(Reporting_person,Group,date) %>%
        group_by(Group,Reporting_person) %>%
        mutate(Ready_plusone=lag(Ready)) 


newR = na.omit(newR)

###------------------------Add the column to the new data frame
### Creating the REady change column   1 is a change , 0 no change
### Creating the change of state , I use this because you seem to have more than 2 levels.
new$State_change = paste(newR$Ready,newR$Ready_plusone,sep="_")

### Getting the frequency of Change by Change of State(Ready Yes-no..no-yes..)
result <- new %>% 
                group_by(Reporting_person,State_change) %>%
                count(Change) %>%
                mutate(Frequence = prop.table(n))%>%
                filter(Change==1)

 ### Tidyr is a great library for reshape data, you want the wide format of the previous long 
 ### dataframe... However doing this will generate a lot of NA so If I were you I would get 
 ### the result format instead of the following but this could be helpful for future need so here you go.

library(tidyr)

final = as.data.frame(spread(result, key = State_change, value = prop))[,c(1,4:7)]

Hope this help :)

This is great, thanks! Not necessarily looking for a correlation (though now I will), just poor specification on my end. Is there are way to determine how often a comments changed when "ready" does? So, % of time comments change when it goes from yes->no, no->yes (all combinations) and break it down by group. The table would be columns: No->Yes, Yes->No, No->Maybe, etc... The rows would be reporting person or group: Trucks, Cars The values would be % of change= frequency of "did change" / total comments in that section. — Alex, Jun 15 '17 at 17:29
Of course ... I will add after lunch but you could merge the two code :) — Nico Coallier, Jun 15 '17 at 17:31
Thanks so much. I've only been in the community for a few days and everyone has been so helpful. Enjoy lunch. — Alex, Jun 15 '17 at 17:34

score 0 · Answer 2 · answered Jun 15 '17 at 14:17

0

Something like this ?

df <-data.frame(date = Sys.Date() - sample(10),
                Group = c("Cars","Trucks") %>% sample(10,replace=T),
                Reporting_person = c("A","B","C") %>% sample(10,replace=T),
                Comments = c("Awesome","Meh","NC") %>% sample(10,replace=T))   

#          date  Group Reporting_person Comments
# 1  2017-06-08 Trucks                B  Awesome
# 2  2017-06-05 Trucks                A  Awesome
# 3  2017-06-14   Cars                B      Meh
# 4  2017-06-06   Cars                B  Awesome
# 5  2017-06-11   Cars                A      Meh
# 6  2017-06-07   Cars                B       NC
# 7  2017-06-09   Cars                A       NC
# 8  2017-06-10   Cars                A       NC
# 9  2017-06-13 Trucks                C  Awesome
# 10 2017-06-12 Trucks                B       NC

aggregate(date ~ .,df,length)

#    Group Reporting_person Comments date
# 1 Trucks                A  Awesome    1
# 2   Cars                B  Awesome    1
# 3 Trucks                B  Awesome    1
# 4 Trucks                C  Awesome    1
# 5   Cars                A      Meh    1
# 6   Cars                B      Meh    1
# 7   Cars                A       NC    2
# 8   Cars                B       NC    1
# 9 Trucks                B       NC    1

answered Jun 15 '17 at 14:17

moodymudskipper

46,417
11
121
167

You can see my above comment to Nico. So imagine someone keeps copying and pasting the same comments over time, because nothing changed. Then, an event occurs, the variable changes, and new comments are added. I'd like to know: 1) how often copy-pasting occurs, by group and//or person, 2) does the comment change when the other variable changes? 3) what percentage of all comments are actually "original" (not copy-paste from previous report). – Alex Jun 15 '17 at 15:48
I think that reply answers 1) by group AND person, and 2). If it doesn't you may want to give the expected output in your question. If it does I can easily complete it for 1) group OR person and 3) – moodymudskipper Jun 15 '17 at 15:58
So the problem is I don't know the specific comments, and there are tens of thousands of observations. I'd like something like a change detection code that counts how often the comments are the same as the previous one. Am I making any sense? – Alex Jun 15 '17 at 16:54
I still don't understand, even if you don't know all specifics I don't see why you can't give us an example of the output you'd like . You could give a table with fake values for example – moodymudskipper Jun 15 '17 at 17:02
Definition of repeated comment: was the comment the same as the last observation from that person or group? Table 1 Group #of repeated comments % repeated comments as a total of group Cars 100 25% Trucks 120 40% SUVs 56 11% Table 2 Reporting Person Number of repeated comments % repeated comments as a total of person A 10 10% B 15 31% C 3300 4% D 103 80% – Alex Jun 15 '17 at 17:15

Finding repeated sentences/words/phrases by group over time

2 Answers2

Frequence of change by change of State