1

I have a data frame like this:

v2      v3
1.000   2:3,3:2,5:2,
2.012   1:5,2:4,6:3,

The second column v3, consists of 'index-value' pairs, each pair separated by a ,.

Within each 'index-value' pair, the number preceeding the : is the vector index. The number after the : is the corresponding value. E.g. in the first row, the vector indices are 2, 3, and 5, and the corresponding values are 3, 2, and 2.

Indices not represented in the string should have the value 0 in the resulting vector.

I wish to convert the 'index-value' vector to a vector of values.

Thus, for the two strings above the expected result is:

v2     v3
1.000  c(0,3,2,0,2,0)
2.012  c(5,4,0,0,0,3)   
Henrik
  • 65,555
  • 14
  • 143
  • 159
user5779223
  • 1,460
  • 3
  • 21
  • 42

3 Answers3

4

We make use of the data.table package just to use its tstrsplit function. It removes an intermediate step. Try this:

require(data.table)
df$v3<-lapply(
  lapply(strsplit(as.character(df$v3),",",fixed=TRUE),tstrsplit,":"),
   function(x) {res<-numeric(6);res[as.numeric(x[[1]])]<-as.numeric(x[[2]]);res})
#     v2               v3
#1 1.000      0,3,2,0,2,0
#2 2.012      5,4,0,0,0,3
  • We first split each element of v3 using the comma (,)
  • We then split again using the : as separator;
  • We create a numeric vector of length 6;
  • We finally fill the values according the described logic.
nicola
  • 24,005
  • 3
  • 35
  • 56
  • I voted you up here, but I would personally prefer not hard-coding the `numeric(6)` part of the answer. – A5C1D2H2I1M1N2O1R2T1 Jan 16 '16 at 07:50
  • @AnandaMahto I agree with you, but the question doesn't give any indication in merit. I wanted to use the `max` index of each element, but in their desired result vectors of length 6 appear. If the OP jumps in and describe better the desired output I'd change accordingly. – nicola Jan 16 '16 at 07:54
  • @nicola Actually the merit should be a constant value so your method is OK. Thanks for you reply and I will take a try right now. – user5779223 Jan 16 '16 at 08:50
  • @nicola Sorry I gave forgotten something. There should be a `,` behind each row. So is there should be any change? – user5779223 Jan 16 '16 at 08:54
  • I tested and it works even if there is a comma at the end. – nicola Jan 16 '16 at 08:58
  • @AnandaMahto Yes it works and the problem has been solved already. Thanks for you two. – user5779223 Jan 16 '16 at 14:25
  • @nicola @nicola But now I have met another problem. I'd like to convert the V3 to a matrix and take each column of V3 as a vector, and then measure the covariance and correlation between the v2 and each vector. But when do that `cov(as.array(data$V2),as.matrix(data$V3) )`, an error occurs: `Error: is.numeric(y) || is.logical(y) is not TRUE`. Do you have any idea of it? – user5779223 Jan 17 '16 at 07:15
1

I would suggest going with an approach like that suggested by @nicola, however, for fun, here's an alternative.

Use read.dcf, which is used to read "tag:value" type data. To get all the "tags", use the fields argument. You've specified this as 1:6 in your comment to @nicola. Also, you need to replace your "," with newline characters ("\n").

We'll store all of this in a string so that deparse + textConnection will be able to handle it. Not necessary for this example, but just in case....

str <- gsub(",", "\n", mydf$v3)
x <- read.dcf(textConnection(str), fields = as.character(1:6))
x <- replace(x, is.na(x), 0)
x
#      1   2   3   4   5   6  
# [1,] "0" "3" "2" "0" "2" "0"
# [2,] "5" "4" "0" "0" "0" "3"

To get it back in your data.frame as a list of numeric vectors, do this:

mydf$v3_l <- lapply(1:nrow(x), function(y) as.numeric(x[y, ]))

Here's the resulting str:

str(mydf)
'data.frame':   2 obs. of  3 variables:
 $ v2  : num  1 2.01
 $ v3  : chr  "2:3,3:2,5:2," "1:5,2:4,6:3,"
 $ v3_l:List of 2
  ..$ : num  0 3 2 0 2 0
  ..$ : num  5 4 0 0 0 3
Community
  • 1
  • 1
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • But that approach just returns a list of number instead of a vector. I try to convert them to vector but fail. Do you have any idea? – user5779223 Jan 18 '16 at 08:30
0

Here's another approach using only base functions.

First the string is split (strsplit) by : or ,. Elements at odd positions correspond to indices, and even positions to values. We pre-allocate a numeric vector of length of the max index.

In the lapply loop, we assign the values of the split vector (i.e. the even elements; x[c(FALSE, TRUE)]) to the pre-alloctad vector vec, at the indices (i.e. the odd elements of the splitted vector; x[c(TRUE, FALSE)]).

l <- strsplit(df$v3, "[:|,]")
vec <- numeric(length = max(as.integer(unlist(l)[c(TRUE, FALSE)])))

df$v3 <- lapply(l, function(x){
  x <- as.numeric(x)
  vec[x[c(TRUE, FALSE)]] <- x[c(FALSE, TRUE)]
  vec
  })

df
#      v2               v3
# 1 1.000 0, 3, 2, 0, 2, 0
# 2 2.012 5, 4, 0, 0, 0, 3
Henrik
  • 65,555
  • 14
  • 143
  • 159