0

I have table with an unequal number of elements in each row, with each element having a count of 1 or 2 appended to a string. I want to create a matrix of presence/absence of each string, but including the count (1,2) and placing a zero if the string is not found.

From this:

  V1      V2      V3         V4      V5
1  A   cat:2   dog:1    mouse:1 horse:2
2  B   dog:2 mouse:2 dolphin:2        
3  C horse:2                           
4  D   cat:1 mouse:2  dolphin:2   

To this:

  cat dog mouse horse dolphin
A 2 1 1 2 0
B 0 2 2 0 2
C 0 0 0 2 0
D 1 0 2 0 2

I have looked up previous solutions to similar problems: Convert a dataframe to presence absence matrix

put they create a 0/1 matrix of absence, not including the count.

sample data:

structure(list(V1 = c("A", "B", "C", "D"),
               V2 = c("cat:2", "dog:2", "horse:2", "cat:1"),
               V3 = c("dog:1", "mouse:2", "", "mouse:2"),
               V4 = c("mouse:1", "dolphin:2", "", "dolphin:2"),
               V5 = c("horse:2", "", "", "")),
               .Names = c("V1", "V2", "V3", "V4", "V5"),
               class = "data.frame", row.names = c(NA, -4L))
7mood7
  • 51
  • 3

2 Answers2

2

Maybe some package could make this easier, but here is a solution. It won't be fast for large data, but it does the job:

#split the strings
tmp <- apply(DF[,-1], 1, strsplit, ":")

#extract the first strings
names <- lapply(tmp,function(x)  c(na.omit(sapply(x, "[", 1))))
uniquenames <- unique(unlist(names))

#extract the numbers
reps <- lapply(tmp,function(x)  as.numeric(na.omit(sapply(x, "[", 2))))

#make the numbers named vectors
res <- mapply(setNames, reps, names)

#subset the named vectors and combine result in a matrix
res <- do.call(rbind, lapply(res, "[",uniquenames))

#cosmetics
colnames(res) <- uniquenames
rownames(res) <- DF$V1
res[is.na(res)] <- 0
#  cat dog mouse horse dolphin
#A   2   1     1     2       0
#B   0   2     2     0       2
#C   0   0     0     2       0
#D   1   0     2     0       2
Roland
  • 127,288
  • 10
  • 191
  • 288
1

You can separate the animals from the counts with separate from tidyr right after melting the data into long format and then dcasting to wide using the counts as values (which need to be casted from character to numeric as a previous step).

data %>% 
  melt("V1") %>% 
  separate(value, c("animal", "count"), ":", fill = "left") %>%  
  transform(count = as.numeric(count)) %>% 
  dcast(V1 ~ animal, value.var = "count", fun.aggregate = sum) %>% 
  select(-"NA")

#   V1 cat dog dolphin horse mouse
# 1  A   2   1       0     2     1
# 2  B   0   2       2     0     2
# 3  C   0   0       0     2     0
# 4  D   1   0       2     0     2
Flavia
  • 298
  • 1
  • 6