String comparison in a row and remove the row which contain same

Question

I have data set like this

x<-c('ASON10_SHROFF-1/3/16/1/02-Au4P','ASON10_SHROFF-1/3/16/1/06-Au4P','ASON10_SHROFF-1/3/16/1/09-Au4P', 'ASON10_SHROFF-1/3/16/1/09-Au4P', 'ASON11_TALWAR-1/3/12/2/04-Au4P', 'ASON11_TALWAR-1/3/12/2/04-Au4P')

y <- c('SERVER_SIGNAL_FAILURE-TMe, UNAVAILABLE_TIME-TMe-PMNE1d, UNEQUIPPED-TMe', 'SERVER_SIGNAL_FAILURE-TMe, UNAVAILABLE_TIME-TMe-PMNE1d, UNEQUIPPED-TMe', 'SERVER_SIGNAL_FAILURE-TMe, REMOTE_DEFECT_INDICATION-TMi, UNAVAILABLE_TIME-TMe-PMNE1d, UNAVAILABLE_TIME-TMi-PMFE1d, UNEQUIPPED-TMe', 'SERVER_SIGNAL_FAILURE-TMe, REMOTE_DEFECT_INDICATION-TMi, UNAVAILABLE_TIME-TMi-PMFE1d, UNAVAILABLE_TIME-TMe-PMNE1d','DEGRADED_SIGNAL-TMe, SERVER_SIGNAL_FAILURE-TMe, UNEQUIPPED-TMe','UNEQUIPPED-TMe, UNEQUIPPED-TMe,UNEQUIPPED-TMe')

df <-data.frame(x,y)
df <- data.frame(lapply(df, as.character), stringsAsFactors = F)

I want to remove the same elements in the y column that are separated by comma(,) and count the ratio. I have tried below code but it still remain same entry and also concatenate by x

library(dplyr)

z<-df %>%
  mutate(row = row_number(), 
         y1 = y) %>%
  add_count(x, name = 'cx') %>%
  tidyr::separate_rows(y1, sep = ",") %>%
  group_by(row) %>%
  summarise(across(c(x, cx, y), first), 
            cy = n(), 
            rat = cy/cx, 
            n = n_distinct(y1)) %>%
  filter(n > 1) %>%
  select(-row, -n)

Desire output is

x<-c('ASON10_SHROFF-1/3/16/1/02-Au4P','ASON10_SHROFF-1/3/16/1/06-Au4P','ASON10_SHROFF-1/3/16/1/09-Au4P', 'ASON10_SHROFF-1/3/16/1/09-Au4P', 'ASON11_TALWAR-1/3/12/2/04-Au4P')
cx <-c(1,1,2,2,1)
y <- c('SERVER_SIGNAL_FAILURE-TMe, UNAVAILABLE_TIME-TMe-PMNE1d, UNEQUIPPED-TMe', 'SERVER_SIGNAL_FAILURE-TMe, UNAVAILABLE_TIME-TMe-PMNE1d, UNEQUIPPED-TMe', 'SERVER_SIGNAL_FAILURE-TMe, REMOTE_DEFECT_INDICATION-TMi, UNAVAILABLE_TIME-TMe-PMNE1d, UNAVAILABLE_TIME-TMi-PMFE1d, UNEQUIPPED-TMe', 'SERVER_SIGNAL_FAILURE-TMe, REMOTE_DEFECT_INDICATION-TMi, UNAVAILABLE_TIME-TMi-PMFE1d, UNAVAILABLE_TIME-TMe-PMNE1d','DEGRADED_SIGNAL-TMe, SERVER_SIGNAL_FAILURE-TMe, UNEQUIPPED-TMe')
cy <-c(3,3,5,4,3)
rat <-c(3/1,3/1,5/2,5/2,3,1)

First, [please don't ask the same question](https://meta.stackexchange.com/q/64068/422674) [on different sites](https://datascience.stackexchange.com/q/86731/64377). Second, please try to make your question as clear as possible: "same elements" -> same as what? ; "count the ratio" -> ratio of what? ; x and y are different at the beginning and at the end ; and your final desired outcome seems to contain mistakes in the last two values of `rat` (as far as I understand). — Erwan, Dec 15 '20 at 23:46

score 1 · Accepted Answer · answered Dec 15 '20 at 23:48

Since I took the time to decipher the question, let me propose an answer:

library(plyr)
library(stringr)

count_items_between_commas <- function(char_vector) {
  l<-llply(char_vector, function(s) {
    r<-strsplit(str_replace_all(s, ' ',''),split = ',',fixed = TRUE)
    length(r[[1]])
  })
  unlist(l)
}

x<-c('ASON10_SHROFF-1/3/16/1/02-Au4P','ASON10_SHROFF-1/3/16/1/06-Au4P','ASON10_SHROFF-1/3/16/1/09-Au4P', 'ASON10_SHROFF-1/3/16/1/09-Au4P', 'ASON11_TALWAR-1/3/12/2/04-Au4P')
y <- c('SERVER_SIGNAL_FAILURE-TMe, UNAVAILABLE_TIME-TMe-PMNE1d, UNEQUIPPED-TMe', 'SERVER_SIGNAL_FAILURE-TMe, UNAVAILABLE_TIME-TMe-PMNE1d, UNEQUIPPED-TMe', 'SERVER_SIGNAL_FAILURE-TMe, REMOTE_DEFECT_INDICATION-TMi, UNAVAILABLE_TIME-TMe-PMNE1d, UNAVAILABLE_TIME-TMi-PMFE1d, UNEQUIPPED-TMe', 'SERVER_SIGNAL_FAILURE-TMe, REMOTE_DEFECT_INDICATION-TMi, UNAVAILABLE_TIME-TMi-PMFE1d, UNAVAILABLE_TIME-TMe-PMNE1d','DEGRADED_SIGNAL-TMe, SERVER_SIGNAL_FAILURE-TMe, UNEQUIPPED-TMe')
df <-data.frame(x,y,stringsAsFactors = FALSE)

df$cx <- as.numeric(ave(df$x, df$x, FUN=length))   # from https://stackoverflow.com/a/36525074/891919
df$cy <- count_items_between_commas(df$y)
df$rat <- df$cy / df$cx

As far as I undertand this is what you want, right?

score 0 · Answer 2 · answered Dec 09 '20 at 14:31

0

I think the problem lies in your attempt to separate. You would like to separate by commas only, but dplyr seems to separate by comma and underscore.

z<- df %>% mutate(row = row_number(), y1 = y)
z<- z %>% add_count(x, name = 'cx')
z<- z %>% tidyr::separate_rows(y1)

Adding the information of the delimitor (sep = ",")should fix your output.

z<- df %>% mutate(row = row_number(), y1 = y)
z<- z %>% add_count(x, name = 'cx')
z<- z %>% tidyr::separate_rows(y1, sep = ",")

answered Dec 09 '20 at 14:31

Mario Niepel

1,095
4
19

Yes, its correct. I fixed it but final row still available in final output. I mean comparison and removing part not run correctly – Isuru Dec 09 '20 at 15:26
Could you update your question then with the edit in place and now clearly indicate what the remaining problem is? – Mario Niepel Dec 09 '20 at 15:42
I have update the question can you find the solution – Isuru Dec 11 '20 at 10:25

String comparison in a row and remove the row which contain same

2 Answers2