-1

I want to remove the rows which have the same two or more words after each other, like a sequence. This is to do a sequential pattern mining analysis.

I already tried the distinct() and duplicated() function, but this only removes the whole row.

r_seq_5 <- r_seq_5[!duplicated(r_seq_5),] # remove duplicates


   #       Su Score result ROI       next_roi  third_roi  four_roi   five_roi   
   #  1     1    90 high   Elsewhere Elsewhere Teacher    Teacher    Teacher   
   #  2     1    90 high   Elsewhere Teacher   Teacher    Teacher    Teacher   
   #  3     1    90 high   Teacher   Pen       Teacher    Elsewhere  Smartboard

This is the table. If Teacher is two or three times in the sentence it doesn't matter, as long as it is not after each other.

The desired result is:

# 1     1    90 high   Teacher   Pen       Teacher    Elsewhere  Smartboard
berendpsv
  • 3
  • 4
  • 2
    Could you provide a reproducible example? https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – bbiasi May 13 '19 at 15:01
  • What do you need? I actually need some kind of for loop which checks whether the next word/value in the row is the same, and if so then that row should be deleted. I want only rows with different words next to each other. For example: AAABB is wrong, but ABABA is good. Or AABBC is wrong, but ABCAB is good. Hopefully this is a better explaning – berendpsv May 13 '19 at 15:04
  • Well the reproductible example is right here in OP's question but here's one if you want : – Gainz May 13 '19 at 15:11
  • x1 <- c(90,90,90) x2 <- c("high", "high", "high") x3 <- c("Elsewhere", "Elserwhere", "Teacher") x4 <- c("Elserwhere", "Teacher", "Pen") x5 <- c("Teacher", "Teacher", "Teacher") x <- data.frame(x1,x2,x3,x4,x5) – Gainz May 13 '19 at 15:11

2 Answers2

0

You can use gather() in order to regroup your variable, and then build a loop to identify in the value in the same as the precedent one.

Finally, use spread() in order to rebuild your inital structur.

df <- data.frame(
  row = 1:4,
  Su = 1,
  Score = 90,
  result = 'high',
  ROI  = c('A', 'A', 'B', 'A'),
  ROI2 = c('A', 'B', 'C', 'B'),
  ROI3 = c('B', 'B', 'A', 'C')
) %>% 
  gather(-(row:result), key = roi, value = value) %>% 
  arrange(row) %>% 
  mutate(repeated = 0)

for(i in 2:nrow(df)){
  if(df$row[i] == df$row[i-1] & df$value[i] == df$value[i-1])
    df$repeated[i] = 1
}

df %>% 
  group_by(row) %>% 
  mutate(repeated = sum(repeated)) %>% 
  filter(repeated == 0) %>% 
  select(-repeated) %>% 
  spread(key = roi, value = value)

#     row    Su Score result ROI   ROI2  ROI3 
#   <int> <dbl> <dbl> <fct>  <chr> <chr> <chr>
# 1     3     1    90 high   B     C     A    
# 2     4     1    90 high   A     B     C 
demarsylvain
  • 2,103
  • 2
  • 14
  • 33
  • Thanks for your answer! However, I do not want to count them, I want to delete the rows which contains the repeated value and only keep to ones with unique words next to each other. – berendpsv May 13 '19 at 15:31
  • I edited the answer. When the rows are identified, you can remove the row with at least one repetition, and then use `spread()` to recreate your initial rows. – demarsylvain May 13 '19 at 15:53
0

To do this, I have found it convenient to turn the factors into numbers. And this was my first step, because to compare macth of columns this path seems to be less arduous.

For this I used a for, the qdap package, because in macth I replaced the values with NA.

library(dplyr)
library(qdap)
df <- data.frame(Su = rep(1,3),
                 Score = rep(90,3),
                 ROI = c("A", "A", "B"),
                 NETX_ROI = c("A", "B", "C"),
                 third_roi = rep("B", 3),
                 four_roi = c("B", "B", "A"),
                 five_roi = c("B", "B", "D"))
df
> df
  Su Score ROI NETX_ROI third_roi four_roi five_roi
1  1    90   A        A         B        B        B
2  1    90   A        B         B        B        B
3  1    90   B        C         B        A        D
df2 <- df
roi <- c("A", "B", "C", "D")
# A = Elsewhere
# B = Teacher
# C = Pen
# D = Smartboard

n <- seq(1, length.out = length(roi))
for (i in 1:length(n)) {
  df2[df2 == roi[i]] <- NA
  df2 <- qdap::NAer(df2, i)
}
> df2
  Su Score ROI NETX_ROI third_roi four_roi five_roi
1  1    90   1        1         2        2        2
2  1    90   1        2         2        2        2
3  1    90   2        3         2        1        4
df2 <- df2 %>% 
  dplyr::select(-c(Su, Score)) %>% 
  as.matrix()

nn <- ncol(df2)
x  <- matrix(nrow = nrow(df2), ncol = ncol(df2)-1)
for (i in 1:(nn-1)) {
  xx <- ifelse(df2[,i] == df2[,i+1], NA, 0)
  x[,i] <- as.matrix(xx)
}
> x
     [,1] [,2] [,3] [,4]
[1,]   NA    0   NA   NA
[2,]    0   NA   NA   NA
[3,]    0    0    0    0

Finally, I just removed the lines with NA.

dfx <- x %>% 
  as.data.frame()

df_test <- df %>% 
  dplyr::bind_cols(dfx) %>% 
  na.omit() %>% 
  dplyr::select(1:ncol(df))
df_test
> df_test
  Su Score ROI NETX_ROI third_roi four_roi five_roi
3  1    90   B        C         B        A        D
bbiasi
  • 1,549
  • 2
  • 15
  • 31
  • Could you upvote for the answer? https://meta.stackexchange.com/questions/173399/how-to-upvote-on-stack-overflow – bbiasi May 20 '19 at 15:37