-1

Please note I read a similar question about splitting a column of a data frame to multiple columns, but my case is different.

My form of data is as follows:

  name    description      
1 a       hello|hello again|something 
2 b       hello again|something|hello
3 c       hello again|hello
4 d

I'd like to split the description column as follows:

  name    description_1 description_2 description_3
1 a       hello         hello again   something 
2 b       hello         hello again   something 
3 c       hello         hello again   N/A
4 d       N/A           N/A           N/A

Any suggestions, directions?

EDIT: Following @akrun and @Sotos answers (thanks!), here's a more accurate presentation of my data:

  name    description      
1 a       add words|change|approximate 
2 b       control|access|approximate
4 d

therefore, sorting the data alphabetically results in:

  name    description_1    description_2 description_3
1 a       add words        approximate   change 
2 b       access           approximate   control 
4 d       N/A           N/A           N/A

while, what I need is:

  name    desc_1      desc_2       desc_3   desc_4   desc_5
1 a       add words   approximate  change   N/A      N/A
2 b       N/A         approximate  N/A      control  access 
4 d       N/A         N/A          N/A      N/A      N/A

I don't mind how description is sorted (if at all), as long as at each column (desc_1..5) I will have the same description. Hopefully, that clarifies my problem.

Community
  • 1
  • 1
Oshrat
  • 195
  • 2
  • 2
  • 10
  • Could you please explain the rules for getting the expected output so that others won't have to think in bizzarre directions to get the expected output. – akrun Aug 22 '16 at 07:49
  • You haven't replied about the `sort`ing thing. What is the logic? – akrun Aug 22 '16 at 07:56
  • Yes, the question was not well defined, which is probably the reason for not succeeding yet. I'm editing the question. – Oshrat Aug 22 '16 at 08:39
  • Your new expected output is not clear with respect to the logic followed for placing the `control`, `access` in the 4th and 5h columns in the expected output – akrun Aug 22 '16 at 09:38

3 Answers3

4

We can use match to change the order based on the order of the first entry of description, and then split using cSplit from splitstackshape package,

library(splitstackshape)
#make sure column 'description' is a character
df$description <- as.character(df$description)

ind <- strsplit(df$description, '\\|')[[1]]
df$description <- sapply(strsplit(df$description, '\\|'), function(i) 
                                         paste(i[order(match(i, ind))], collapse = '|'))

cSplit(df, 'description', sep = '|', 'wide')
#   name description_1 description_2 description_3
#1:    a         hello   hello_again     something
#2:    b         hello   hello_again     something
#3:    c         hello   hello_again            NA
#4:    d            NA            NA            NA
Sotos
  • 51,121
  • 6
  • 32
  • 66
3

We can use base R by splitting the 'description' column by '|' (NOTE: If the 'description' is factor class, use strsplit(as.character(df1$description), ...)) into a list, sort it, then pad NA at the end for those list elements that have less length than the maximum length of the list element, and cbind with the first column of 'df1'.

lst <- lapply(strsplit(df1$description, "|", fixed = TRUE), sort)
d1 <- setNames(do.call(rbind.data.frame, lapply(lst, `length<-` 
                 ,max(lengths(lst)))), paste0("description_", 1:3))
cbind(df1[1], d1)
#   name description_1 description_2 description_3
#1    a         hello   hello again     something
#2    b         hello   hello again     something
#3    c         hello   hello again          <NA>
#4    d          <NA>          <NA>          <NA>

EDIT: Based on @thelatemail's comments

We can also create a factor with levels specified

lvls <- sort(unique(unlist(lst)))
lst <- lapply(lst, function(x) x[order(factor(x, levels = lvls))]) 

and then use the same code as the 'd1' in the above.


Another option is cSplit to split the 'description' column and reshape it to 'long' format, then sort, dcast it to 'wide' and join with the original dataset on the 'name'

library(splitstackshape)
dcast(cSplit(df1, "description", "|", "long")[, sort(description) , by = name], 
  name ~  paste0("description_", rowid(name)), value.var = "V1")[df1[-2], on = "name"]
#   name description_1 description_2 description_3
#1:    a         hello   hello again     something
#2:    b         hello   hello again     something
#3:    c         hello   hello again            NA
#4:    d            NA            NA            NA

Also, with hadleyverse using separate_rows/spread

library(tidyr)
library(dplyr)
separate_rows(df1, description, sep="[|]") %>%
         arrange(name, description) %>% 
         group_by(name) %>% 
         mutate(Seq = paste0("description_", row_number()) ) %>% 
         spread(Seq, description)
#  name description_1 description_2 description_3
#  <chr>         <chr>         <chr>         <chr>
#1     a         hello   hello again     something
#2     b         hello   hello again     something
#3     c         hello   hello again          <NA>
#4     d                        <NA>          <NA>

Update

Regarding the new data in the OP's post, it is not clear about the sorting. But, as the OP mentioned that it doesn't really matter and what matters is the number of 'description" columns

lst <- strsplit(df2$description, "|", fixed = TRUE)
lvls <- sort(unique(unlist(lst)))
d1 <- setNames(do.call(rbind.data.frame, lapply(lst, function(x)  
      ifelse(lvls %in% x, lvls, NA))), paste0("description_", 1:5))
cbind(df2[1], d1) 
# name description_1 description_2 description_3 description_4 description_5
#1    a          <NA>     add words   approximate        change          <NA>
#2    b        access          <NA>   approximate          <NA>       control
#4    d          <NA>          <NA>          <NA>          <NA>          <NA>
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Why is the logic incorrect? Because you assumed that the have to be sorted alphabetically rather than based on a single entry? – Sotos Aug 22 '16 at 07:42
  • @Sotos There is no where in the OP's post it is mentioned that it should be based on the first entry but the `sort` logic is at least more general than what you are doing. – akrun Aug 22 '16 at 07:43
  • Neither that it is based on sorting which makes the assumption that the order of the words is always going to be alphabetically. It is not clear by the OP but we both made an assumption in order to solve this problem. I don't see why my logic is the wrong one. – Sotos Aug 22 '16 at 07:45
  • That's what I said. We BOTH assumed things so why is my assumption the wrong one? – Sotos Aug 22 '16 at 07:49
  • 1
    @Sotos I guess in statistics, most of the things are based on probablility and likelihood. THe likelihood of your logic getting right is less than a sort thing, but that doesn't mean that it could be wrong in the end. – akrun Aug 22 '16 at 07:51
  • 1
    Well, I have a lot of work so let's agree to disagree on this one. – Sotos Aug 22 '16 at 07:53
2

You probably know what format you want, but two alternative representations suggest themselves. The first is a 'long' data frame that describes the 'name' and associated terms,

terms = strsplit(as.character(df$description), "|", fixed=TRUE)
data.frame(
    name = rep(df$name, lengths(terms)),
    term = unlist(terms))

The second is an 'incidence matrix' that has rows and columns corresponding to names and terms, with TRUE values indicating that the particular term occurs in the particular row

term = unlist(terms)
m = matrix(
    FALSE, nrow(df), length(unique(term)),
    dimnames=list(df$name, unique(term)))
idx = cbind(    # a two-column matrix can be used as an 'index' into another matrix
    rep(as.character(df$name), lengths(terms)),
    term)
m[idx] = TRUE

For instance (it is much easier to answer questions when a simple example is provided in a way that can be cut-and-pasted into an R session) with

df = data.frame(
    name=c("a", "b", "c"),
    description=c(
        "add words|change|approximate",
        "control|access|approximate",
        ""))

we have

>     data.frame(
+         name = rep(df$name, lengths(terms)),
+         term = unlist(terms))
  name        term
1    a   add words
2    a      change
3    a approximate
4    b     control
5    b      access
6    b approximate

and

> m
  add words change approximate control access
a      TRUE   TRUE        TRUE   FALSE  FALSE
b     FALSE  FALSE        TRUE    TRUE   TRUE
c     FALSE  FALSE       FALSE   FALSE  FALSE

The 'long' data frame is appropriate for sparse data (many terms, each in only a few rows), the matrix representation for denser data. m can be bound to the original data frame, cbind(df, m), if desired.

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
  • Thanks, that worked perfect, the 'incidence matrix' presentation! and I'll take your comment of how to present a question better next time. – Oshrat Aug 23 '16 at 06:37