1

I asked a question which seems to be very difficult to solve. I have been busy with it for few days already. I try to break down the questions to smaller questions so that I might get some help. The original question is here extract information from a data frame

I have a data frame like below

df<- structure(list(s1 = structure(1:3, .Label = c("3-4", "4-1", "5-4"
    ), class = "factor"), s2 = structure(1:3, .Label = c("2-4", "3-15", 
    "7-16"), class = "factor")), .Names = c("s1", "s2"), row.names = c(NA, 
    -3L), class = "data.frame")
Looks like below

In this example I have two columns but the solution should not specific to only two columns

> df
#   s1         s2
#1 3-4        2-4
#2 4-1        3-15
#3 5-4        7-16

I want to count how many times a string after - is repeated and how many times they appear in each column

Lets say if I look at the first column, I see 4, 1, 4 and second column I see 4, 15 and 16 , so I will have 3 times 4 is repeated , once 1 and once 15 and once 16

M repeated 4 3 1 1 15 1 16 1

If I look at which columns they come from these strings 4 comes two times from column 1 and once from column 2 1 comes once from first column (s1) 15 comes once from column (s2) comes once from second column (s2)

so I will have the output like this

M    repeated  COL1  COL2
4     3         2      1
1     1         1      -
15    1         -      1
16    1         -      1

what I was thinking ? thanks to @Arkun, I can melt the df

M1 <- melt(df, id.var=NULL)

The output will be like this

>M1
   #  variable value
   # 1       s1   3-4
   # 2       s1   4-1
   # 3       s1   5-4
   # 4       s2   2-4
   # 5       s2  3-15
   # 6       s2  7-16

Then I split the values based on the hyphen using below

lst <- setNames(strsplit(M1$value, "-"), M1$variable)

now I have the following

>lst
#$s1
#[1] "3" "4"
#$s1
#[1] "4" "1"
#$s1
#[1] "5" "4"
#$s2
#[1] "2" "4"
#$s2
#[1] "3"  "15"
#$s2
#[1] "7"  "16"

Then I don't know how to get further

Community
  • 1
  • 1
nik
  • 2,500
  • 5
  • 21
  • 48
  • 2
    If you load data.table, then `setDT(M1)[, c("pre","suff") := tstrsplit(value,"-")]; M1[, table(suff,variable)]`. Your "repeated" column is just the sum of the others and can be constructed afterwards. – Frank Mar 01 '16 at 17:41
  • 1
    @Frank that is amazing , one part is solved without any loop :-) , please make it as an answer so I can somehow appreciate your comment. Also if it is possible, can you have a look at the original question? I believe you can help me out – nik Mar 01 '16 at 17:49

1 Answers1

3

As explained by @akrun, "melting" the data and splitting on "-" is a good start:

library(data.table)

m = setDT(melt(df, id.var=NULL))
m[, c("pre","suff") := tstrsplit(value, "-", type.convert=TRUE)]

Then you can summarize the data by tabulating:

m[, table(suff, variable)]

#     variable
# suff s1 s2
#   1   1  0
#   4   2  1
#   15  0  1
#   16  0  1

More generally, you can group by values of the suffix and construct summary stats:

summ = m[, .(
  N   = .N,
  pre = paste(sort(unique(pre)), collapse=",")
), keyby=.(suff, variable)]

#    suff variable N pre
# 1:    1       s1 1   4
# 2:    4       s1 2 3,5
# 3:    4       s2 1   2
# 4:   15       s2 1   3
# 5:   16       s2 1   7

To put the variables in their own columns again...

dcast(summ, suff~variable, value.var=c("N","pre"))

#    suff N_s1 N_s2 pre_s1 pre_s2
# 1:    1    1   NA      4     NA
# 2:    4    2    1    3,5      2
# 3:   15   NA    1     NA      3
# 4:   16   NA    1     NA      7

Check out the introductory materials for the data.table package for more on how this works and what else you can do.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • 1
    I liked and accepted your solution. I learned from it and I appreciate it very much . thanks bro – nik Mar 01 '16 at 18:09
  • @Frank Good post. Plus one. – akrun Mar 01 '16 at 18:13
  • @Mol Maybe it will work if you do `m = setDT(melt(df, id.var=NULL))[value != ""]`. Beyond that, I'm out of ideas. – Frank Mar 01 '16 at 19:31
  • @Mol It does work for me. Can you check the `str(df)` – akrun Mar 01 '16 at 20:08
  • @Mol I tried reading it `df <- read.csv("https://gist.githubusercontent.com/anonymous/d51ad485c7a7b5d04b67/raw/9aea92f372b7a5e8ffd152689e0bb2b071c3d6f7/df.txt", header=TRUE, sep="\t", stringsAsFactors=FALSE)` If you don't want the blanks, and NA's you can remove it. – akrun Mar 01 '16 at 20:09