find how many times and which columns a string is repeated

Question

My data consists of 6 strings per each element. It has string with 6 characters. The data has white space too. I want to know how many times each string is repeated in all columns

for example P67809 is repeated 2 times in column a and column d so the output should look likes

string   No     columns 
P67809   2       a,b

Based on this function I can assign a row number to each string

normalize <- function(x, delim) {
    x <- gsub(")", "", x, fixed=TRUE)
    x <- gsub("(", "", x, fixed=TRUE)
    idx <- rep(seq_len(length(x)), times=nchar(gsub(sprintf("[^%s]",delim), "", as.character(x)))+1)
    names <- unlist(strsplit(as.character(x), delim))
    return(setNames(idx, names))
}

Then I apply the function on all and each columns string like

myS <- lapply(mydata, normalize,";")

but I don't know how to then search and get the output

akrun · Accepted Answer · 2016-02-28T20:04:10.137

2

We could melt the data from 'wide' to 'long' format. Split the 'value' column with ; to get a list output. We set the names of the list as the 'variable' column of 'dM'. Then stack the list to a two column output, and get the frequency count with 'tbl'. It may be easier to understand the result from the 'tbl' output.

library(reshape2)
dM <- melt(mydata, id.var=NULL) 
lst1 <- setNames(strsplit(dM$value, ";"), dM$variable)
tbl <- table(stack(lst1)[2:1])
tbl
values
#ind   A4QPH2 O60814 P0CG47 P0CG48 P14923 P15924 P19338 P35908 P42356 P57053 P58876 P62750 P62807 P62851 P62979 P63241 P67809 Q02413 Q06830 Q07955 Q16658 Q5QNW6 Q6IS14 Q8N8J0 Q93079 Q969S3
#  a 0      1      0      0      0      1      1      1      1      1      0      0      0      0      0      0      0      1      0      0      0      0      0      0      1      0      0
#  b 3      0      0      1      1      0      0      0      0      0      0      0      0      0      0      1      1      0      1      0      0      0      0      1      0      0      0
#  c 1      0      0      1      1      0      0      0      0      0      0      0      0      0      1      1      1      0      1      0      0      1      0      1      0      0      0
#  d 0      0      1      1      1      0      0      0      0      0      1      1      1      1      0      1      0      1      0      1      1      0      1      0      0      1      1
#   values
#ind Q99877 Q99879 Q9Y2T7
#  a      0      0      1
#  b      0      0      0
#  c      0      0      0
#  d      1      1      1

We get the total number of each element with colSums.

cS <- colSums(tbl)

If we need to get the output as in the OP's post, we can melt the list output to create a 2 column data.frame. From this, we convert to 'data.table' (setDT(), grouped by 'value' column , we get the length of unique elements of 'variable' and also paste together the unique elements.

library(data.table)
res <- setDT(melt(lst1))[, list(No= uniqueN(L1),
      columns= toString(unique(L1))) ,.(string=value)]
head(res,2)
#    string No columns
#1: P67809  2    a, d
#2: Q9Y2T7  2    a, d

edited Feb 28 '16 at 20:04

answered Feb 28 '16 at 19:37

akrun

874,273
37
540
662

@Mol Sure I will add the description now – akrun Feb 28 '16 at 19:59
I got a warning using melt Warning message: attributes are not identical across measure variables; they will be dropped . it means some of the strings which are not repeated will be discard ? – nik Feb 28 '16 at 20:02
@Mol It is because you have `factor` columns. Nothing to be worried about. – akrun Feb 28 '16 at 20:04
1

@Mol Thank you for the feedback. Glad it helped you. – akrun Feb 28 '16 at 20:42
do you know which functions I must use to remove those strings which appears more than 10 times from the original data (mydata) ? – nik Feb 28 '16 at 20:47
1

@Mol Try `v1 <- colSums(tbl) >10; v2 <- trimws(names(v1)[v1]); v3 <- v2[nzchar(v2)]; lapply(mydata, gsub, pattern = paste(v3, collapse="|"), replacement="")` – akrun Feb 28 '16 at 20:54
How can I thank you ? it replaces ; instead them. what should I do to replace nothing? Also I used as.data.frame(lapply(mydata, gsub, pattern = paste(v3, collapse="|"), replacement="")) to get the output as a data frame, is it correct ? – nik Feb 28 '16 at 21:02
1

@Mol It is because the delimiter was there. Perhaps `mydata[] <- lapply(mydata, function(x) gsub(";(?=;)|(?<=;);", "", gsub(paste(v3, collapse="|"), "", x), perl=TRUE))` – akrun Feb 28 '16 at 21:06
thanks , still some columns ; remains , probably it is always some remaining. I will do that buy xls and manually . Again thanks for all your help – nik Feb 28 '16 at 21:22
@Mol Perhaps `gsub(";(?=;)|(?<=;);|^;|;$", "", gsub(...` – akrun Feb 29 '16 at 03:37
can I use the above code when my data is not strings but values ? actually I used it and I saw it works but I just want to be sure if it is OK !? – nik Feb 29 '16 at 20:05
@Mol `gsub` takes numbers as well. For example `sub("2", "", c(23, 42))` – akrun Feb 29 '16 at 20:06
@Mol It's too late here. Signing off for today. – akrun Feb 29 '16 at 21:36
do you think today you can have a look at my question ? I have been trying my best but it made me crazy with no success !!! I could not get it done . I will really appreciate if you could help me. http://stackoverflow.com/questions/35707323/how-to-rearrange-an-order-of-matches-between-two-data-frames – nik Mar 01 '16 at 12:57

DatamineR · Answer 2 · 2016-02-28T20:04:06.987

One approach might be:

res <- apply(mydata, 2, function(x) unlist(strsplit(x, ";")))
un <- unique(unlist(res))
res2 <- sapply(un, function(x) lapply(res, function(y) as.numeric(x %in% y)))

res2

  P67809 Q9Y2T7 P42356 Q8N8J0 A4QPH2 P35908 P19338 P15924 P14923 Q02413 P63241 Q6IS14
a 1      1      1      1      1      1      1      1      1      0      0      0     
b 0      0      0      0      0      0      0      0      0      1      1      1     
c 0      0      0      0      0      0      0      0      0      1      1      1     
d 1      1      0      0      0      0      0      0      0      0      0      0     
  P62979 P0CG47 P0CG48   Q16658 P62851 Q07955 Q06830 P62807 O60814 P57053 Q99879 Q99877
a 0      0      0      0 0      0      0      0      0      0      0      0      0     
b 1      1      1      1 0      0      0      0      0      0      0      0      0     
c 1      1      1      1 1      1      0      0      0      0      0      0      0     
d 1      1      1      0 0      0      1      1      1      1      1      1      1     
  Q93079 Q5QNW6 P58876 P62750 Q969S3
a 0      0      0      0      0     
b 0      0      0      0      0     
c 0      0      0      0      0     
d 1      1      1      1      1   

as.data.frame(t(apply(t(res2), 1, function(x) cbind(sum(as.numeric(x)), paste(names(x)[which(as.logical(x))], collapse = ",")))))
       V1    V2
P67809  2   a,d
Q9Y2T7  2   a,d
P42356  1     a
Q8N8J0  1     a
A4QPH2  1     a
P35908  1     a
P19338  1     a
P15924  1     a
P14923  1     a
Q02413  2   b,c
P63241  2   b,c
Q6IS14  2   b,c
P62979  3 b,c,d
P0CG47  3 b,c,d
P0CG48  3 b,c,d
        2   b,c
Q16658  1     c
P62851  1     c
Q07955  1     d
Q06830  1     d
P62807  1     d
O60814  1     d
P57053  1     d
Q99879  1     d
Q99877  1     d
Q93079  1     d
Q5QNW6  1     d
P58876  1     d
P62750  1     d
Q969S3  1     d

Thanks , I already liked your answer. I will check your answer too — nik, Feb 28 '16 at 20:27
I have asked a question which no one answers and i have been busy to solve it desperately with no success, I will appreciate if you could have a look at it http://stackoverflow.com/questions/35707323/how-to-rearrange-an-order-of-matches-between-two-data-frames — nik, Mar 01 '16 at 15:01

mtoto · Answer 3 · 2016-02-28T20:10:53.300

1

An alternative approach with cSplit from splitstackshape and gather from tidyr.

library(splitstackshape)
library(tidyr)
library(dplyr)

splitted <- cSplit(mydata, splitCols = names(mydata), sep = ";") %>% gather() # Split cols and melt data
splitted$key <- substring(splitted$key, 1, 1) # Lose irrelevant string
table(splitted) # Generate frequency table

edited Feb 28 '16 at 20:10

answered Feb 28 '16 at 20:02

mtoto

23,919
4
58
71

Thanks , I already liked your answer. I will check your answer too – nik Feb 28 '16 at 20:27
I have asked a question which no one answers and i have been busy to solve it desperately with no success, I will appreciate if you could have a look at it http://stackoverflow.com/questions/35707323/how-to-rearrange-an-order-of-matches-between-two-data-frames – nik Mar 01 '16 at 15:00

find how many times and which columns a string is repeated

3 Answers3