4

I am using RStudio 2.15.0 and have created an object from Excel using XLConnect with 3000+ rows and 12 columns I am trying to delimit/split a column into the rows but don't know if this is possible or how to do it. Example of the data below using the 3 columns in connection. any help on this would be grand.

Code that is working for 2 of the columns is below.

v1 <- with(df, tapply(PolId, Description,  FUN= function(x) {
x1 <- paste(x, collapse=";")
gsub('(\\b\\S+\\b)(?=.*\\b\\1\\b.*);', '',     x1, perl=TRUE)}))
library(stringr)
Description <- rep(names(v1),  str_count(v1, '\\w+'))
PolId <- scan(text=gsub(';+', ' ', v1), what='', quiet=TRUE)
data.frame(PolId, Description)  

Sample data

PolId   Description  Document.Type
ABC123;ABC456;ABC789;   TEST1  Pol1
ABC123;ABC456;ABC789;   TEST1  Pol1
ABC123;ABC456;ABC789;   TEST1  Pol1
AAA123; TEST1  End1
AAA123; TEST2  End2
ABB123;ABC123;  TEST3  End1
ABB123;ABC123;  TEST3  End1

I want the output to be like this (replacing the duplicate Polid's)

PolId   Description  Document.Type
ABC123  TEST1        Pol1
ABC456  TEST1        Pol1
ABC789  TEST1        Pol1
AAA123  TEST1        End1
AAA123  TEST2        End2
ABB123  TEST3        End1
ABC123  TEST3        End1
New2Programming
  • 351
  • 1
  • 4
  • 17

3 Answers3

7

Here is a base R solution. Split the PolId field using strplit and for each such split field cbind it with the corresponding Description. This gives a list of matrices which we rbind together. Finally set the column names.

out <- do.call(rbind, Map(cbind, strsplit(DF$PolId, ";"), DF$Description))
colnames(out) <- colnames(DF)

giving:

> out
      PolId    Description
 [1,] "ABC123" "TEST1"    
 [2,] "ABC456" "TEST1"    
 [3,] "ABC789" "TEST1"    
 [4,] "ABC123" "TEST1"    
 [5,] "ABC456" "TEST1"    
 [6,] "ABC789" "TEST1"    
 [7,] "ABC123" "TEST1"    
 [8,] "ABC456" "TEST1"    
 [9,] "ABC789" "TEST1"    
[10,] "AAA123" "TEST1"    
[11,] "AAA123" "TEST2"    
[12,] "ABB123" "TEST3"    
[13,] "ABC123" "TEST3"    
[14,] "ABB123" "TEST3"    
[15,] "ABC123" "TEST3" 

Note: We used this as the input:

DF <-
structure(list(PolId = c("ABC123;ABC456;ABC789;", "ABC123;ABC456;ABC789;", 
"ABC123;ABC456;ABC789;", "AAA123;", "AAA123;", "ABB123;ABC123;", 
"ABB123;ABC123;"), Description = c("TEST1", "TEST1", "TEST1", 
"TEST1", "TEST2", "TEST3", "TEST3")), .Names = c("PolId", "Description"
), class = "data.frame", row.names = c(NA, -7L))
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thank you for this although I am getting the following. > out <- do.call(rbind, Map(cbind, strsplit(df$PolId, ";"), df$Description)) > colnames(out) <- colnames(df) Error in `colnames<-`(`*tmp*`, value = c("Document.Title", "Document.Type", : length of 'dimnames' [2] not equal to array extent – New2Programming Feb 25 '15 at 15:25
  • 1
    Try it with the input used in the answer. If that works for you and does not work with your actual input then determine how your actual input is different from the input that was used. In general, place the output of `dput(DF)` in your question to make it unambiguous. – G. Grothendieck Feb 25 '15 at 16:00
5

Here's a quick data.table possible solution

library(data.table)
unique(setDT(df)[, .(PolId = unlist(strsplit(as.character(PolId), ";"))), by = Description])
#    Description  PolId
# 1:       TEST1 ABC123
# 2:       TEST1 ABC456
# 3:       TEST1 ABC789
# 4:       TEST1 AAA123
# 5:       TEST2 AAA123
# 6:       TEST3 ABB123
# 7:       TEST3 ABC123

Per your edit- Another option (in case you have more than two columns)

library(splitstackshape)
unique(cSplit(df, "PolId", ";", "long"))
#     PolId Description Document.Type
# 1: ABC123       TEST1          Pol1
# 2: ABC456       TEST1          Pol1
# 3: ABC789       TEST1          Pol1
# 4: AAA123       TEST1          End1
# 5: AAA123       TEST2          End2
# 6: ABB123       TEST3          End1
# 7: ABC123       TEST3          End1
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
3

You could try unnest from tidyr after splitting the "PolId" column and get the unique rows

library(dplyr)
library(tidyr)
 unnest(setNames(strsplit(df$PolId, ';'), df$Description), 
                                  Description) %>% unique()

Or using base R with stack/strsplit/duplicated. Split the "PolId" (strsplit) by the delimiter(;), name the output list elements with "Description" column, stack the list to get a 'data.frame' and use duplicated to remove the duplicate rows.

df1 <- stack(setNames(strsplit(df$PolId, ';'), df$Description))
setNames(df1[!duplicated(df1),], names(df))
#     PolId Description
#1  ABC123       TEST1
#2  ABC456       TEST1
#3  ABC789       TEST1
#10 AAA123       TEST1
#11 AAA123       TEST2
#12 ABB123       TEST3
#13 ABC123       TEST3

Or another option without using strsplit

v1 <- with(df, tapply(PolId, Description, FUN= function(x) {
            x1 <- paste(x, collapse=";")
        gsub('(\\b\\S+\\b)(?=.*\\b\\1\\b.*);', '', x1, perl=TRUE)}))
library(stringr)
Description <- rep(names(v1),  str_count(v1, '\\w+'))
PolId <- scan(text=gsub(';+', ' ', v1), what='', quiet=TRUE)
data.frame(PolId, Description)
#   PolId Description
#1 ABC123       TEST1
#2 ABC456       TEST1
#3 ABC789       TEST1
#4 AAA123       TEST1
#5 AAA123       TEST2
#6 ABB123       TEST3
#7 ABC123       TEST3
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Apologies but I don't understand the gsub code but this works. If I wanted to add an additional column to the code how would I do this? – New2Programming Feb 26 '15 at 11:45
  • Could you update the question with the additional column, in that it will be clear. The `gsub` keeps only the unique values and discard the duplicates – akrun Feb 26 '15 at 12:30
  • 1
    or this variation: `df %>% mutate(PolId = strsplit(PolId, ";")) %>% unnest()` – G. Grothendieck Sep 24 '16 at 11:47