0

I have a set of data:

 name<-c("A","B","C")
 type<-c("U","","D")
 content<-c("ABC (0001 - test), CCC (0002 - test1), DDD (0004 - test 2)", "CCC (0002 - test1)", "N/A")
 data<-data.frame(name,type,content)

It looks like this:

  name type                                                    content
  A    U     ABC (0001 - test), CCC (0002 - test1), DDD (0004 - test 2)
  B                                                  CCC (0002 - test1)
  C    D                                                           N/A

In R, could I transform it to:

 name   type         content              code
 A      U            ABC (0001 - test)    0001
 A      U            CCC (0002 - test1)   0002
 A      U            DDD (0004 - test 2)  0004
 B                   CCC (0002 - test1)   0002
 C      D            N/A                  N/A

I am not sure if I can do this in R? Could anyone help?
Very new to R, so some explanations would be appreciated.

Head and toes
  • 659
  • 3
  • 11
  • 29
  • Highly related, maybe duplicate post [here](https://stackoverflow.com/questions/45849727/how-to-custom-flatten-a-data-frame). – lmo Aug 24 '17 at 15:16
  • You can try `res <- splitstackshape::cSplit(data, "content", direction = "long"); res$code <- stringr::str_extract(res$content, "\\d{4}")` – Steven Beaupré Aug 24 '17 at 15:30

2 Answers2

7

Here's a base R approach:

# split the contents by comma:
x <- strsplit(as.character(data$content), ", ", fixed = T)
# add new rows with each content:
data <- cbind(data[rep(1:nrow(data), lengths(x)), 1:2], content = unlist(x))
# extract and add the code:
data$code <- sub(".*\\((\\d+)\\s.*", "\\1", data$content)

The resulting data.frame looks like this:

data
#     name type             content code
# 1      A    U   ABC (0001 - test) 0001
# 1.1    A    U  CCC (0002 - test1) 0002
# 1.2    A    U DDD (0004 - test 2) 0004
# 2      B       CCC (0002 - test1) 0002
# 3      C    D                 N/A  N/A
talat
  • 68,970
  • 21
  • 126
  • 157
2
library(stringr)
do.call(rbind, lapply(1:NROW(data), function(i)
    setNames(data.frame(data$name[i],
                        data$type[i],
                        unlist(strsplit(as.character(data$content[i]), ", ")),
                        str_extract(unlist(strsplit(as.character(data$content[i]), ", ")),
                                    "\\d{4}")),
             c(names(data), "Code"))))
#  name type             content Code
#1    A    U   ABC (0001 - test) 0001
#2    A    U  CCC (0002 - test1) 0002
#3    A    U DDD (0004 - test 2) 0004
#4    B       CCC (0002 - test1) 0002
#5    C    D                 N/A <NA>
d.b
  • 32,245
  • 6
  • 36
  • 77
  • 1
    Great solution. – Sagar Aug 24 '17 at 15:12
  • Amazing. Thanks! I don't quite understand what the setName function is doing? – Head and toes Aug 24 '17 at 15:19
  • @Headandtoes, `setNames`: sets the column names of the final `data.frame` to `c(names(data), "Code")` – d.b Aug 24 '17 at 15:24
  • One question - if the Code can be more than 4 digits, say it can be 0001A, or 0001D (some are still 4 digits), I assume I can't use \\d{4} ? What could i do instead? – Head and toes Aug 24 '17 at 15:40
  • @Headandtoes, something like `"(\\d{4})(\\D{1})"` might work. But I'm not so good with `regex` so if you have more complicated patterns, you may want to ask another question – d.b Aug 24 '17 at 15:45