3

I have a nasty data table that has a couple of different kinds of messiness, and I can't figure out how to combine some of the other answers that use the tidyr and splitstackshape packages.

subject <- c("A", "B", "C")
review <- c("Bill: [1.0]", "Bill: [2.0], Cathy: [3.0]", "Fred: [4.0], Cathy: [2.0]")
data.table(cbind(subject, review))

which gives:

   subject                    review
1:       A               Bill: [1.0]
2:       B Bill: [2.0], Cathy: [3.0]
3:       C Fred: [4.0], Cathy: [2.0]

This exhibits tidyr messiness with multiple variables stored in one column, along with some ugly formatting.

What I want is a table like:

subject  Bill  Fred  Cathy
A        1.0   0.0   0.0
B        2.0   0.0   3.0
C        0.0   4.0   2.0
Jaap
  • 81,064
  • 34
  • 182
  • 193
bikeclub
  • 369
  • 2
  • 10

4 Answers4

2

This should do it. I recommend inspecting intermediate results to understand the different steps:

# example setup
library(tidyverse)

subject <- c("A", "B", "C")
review <- c("Bill: [1.0]", "Bill: [2.0], Cathy: [3.0]", "Fred: [4.0], Cathy: [2.0]")
dt <- tibble(subject, review)

# solution
dt %>% 
  separate_rows(review, sep = ",") %>%
  separate(review, c("name", "interval"), sep = ":") %>%
  mutate(interval = as.numeric(str_replace_all(interval, "\\[|\\]", ""))) %>%
  complete(subject, name) %>%
  replace_na(list(interval = 0)) %>%
  spread(name, interval)
67342343
  • 816
  • 5
  • 11
2

Here is an option using data.table

library(data.table)
dcast(dt[, strsplit(review, ", "),  subject][, 
    c('v1', 'v2') := tstrsplit(V1, ":\\s+\\[|\\]")],
       subject ~ v1, value.var = 'v2', fill = 0)
#   subject Bill Cathy Fred
#1:       A  1.0     0    0
#2:       B  2.0   3.0    0
#3:       C    0   2.0  4.0

data

dt <- data.table (subject, review) 
akrun
  • 874,273
  • 37
  • 540
  • 662
1

The "splitstackshape" approach would similarly require first splitting to a "long" form, then again to a "wide" form, and then reshaping the data.

library(splitstackshape)
library(magrittr)

DT %>% 
  .[, review := gsub("\\[|\\]", "", review)] %>% 
  cSplit("review", ",", "long") %>% 
  cSplit("review", ":", "wide") %>% 
  dcast(subject ~ review_1, value.var = "review_2", fill = 0)
##    subject Bill Cathy Fred
## 1:       A    1     0    0
## 2:       B    2     3    0
## 3:       C    0     2    4
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
0

This may be of another way of doing it.

library(data.table)
library(tidyr)
t <- data.table (subject, review)
tmp <- t[,c(text=strsplit(review, " ", fixed = TRUE)), by =subject]
tmp$text <- gsub("[^[:alnum:][:space:].]", "", tmp$text)

subject <- tmp$subject[is.na(extract_numeric(tmp$text))]
col2 <- tmp$text[is.na(extract_numeric(tmp$text))]
col3 <- extract_numeric(tmp$text)[!is.na(extract_numeric(tmp$text))]
tmp2 <- data.frame(cbind (subject, col2, col3))
library(reshape2)
m <- dcast(tmp2, subject~col2, value.var="col3")
m[is.na(m)] <- 0
alhan
  • 1