36

What I am trying to accomplish is splitting a column into multiple columns. I would prefer the first column to contain "F", second column "US", third "CA6" or "DL", and the fourth to be "Z13" or "U13" etc etc. My entire df follows the same pattern of X.XX.XXXX.XXX or X.XX.XXX.XXX or X.XX.XX.XXX and I know the third column is where my problem lies because of the different lengths. I have only used substr in the past and I could use that here with some if statements but would like to learn how to use stringr package and POSIX to do this (unless there is a better option). Thank you in advance.

Here is my df:

c("F.US.CLE.V13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", 
"F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", 
"F.US.DL.U13", "F.US.DL.U13", "F.US.DL.U13", "F.US.DL.Z13", "F.US.DL.Z13"
)
smci
  • 32,567
  • 20
  • 113
  • 146
Tim
  • 776
  • 3
  • 8
  • 15

4 Answers4

57

A very direct way is to just use read.table on your character vector:

> read.table(text = text, sep = ".", colClasses = "character")
   V1 V2  V3  V4
1   F US CLE V13
2   F US CA6 U13
3   F US CA6 U13
4   F US CA6 U13
5   F US CA6 U13
6   F US CA6 U13
7   F US CA6 U13
8   F US CA6 U13
9   F US  DL U13
10  F US  DL U13
11  F US  DL U13
12  F US  DL Z13
13  F US  DL Z13

colClasses needs to be specified, otherwise F gets converted to FALSE (which is something I need to fix in "splitstackshape", otherwise I would have recommended that :) )


Update (> a year later)...

Alternatively, you can use my cSplit function, like this:

cSplit(as.data.table(text), "text", ".")
#     text_1 text_2 text_3 text_4
#  1:      F     US    CLE    V13
#  2:      F     US    CA6    U13
#  3:      F     US    CA6    U13
#  4:      F     US    CA6    U13
#  5:      F     US    CA6    U13
#  6:      F     US    CA6    U13
#  7:      F     US    CA6    U13
#  8:      F     US    CA6    U13
#  9:      F     US     DL    U13
# 10:      F     US     DL    U13
# 11:      F     US     DL    U13
# 12:      F     US     DL    Z13
# 13:      F     US     DL    Z13

Or, separate from "tidyr", like this:

library(dplyr)
library(tidyr)

as.data.frame(text) %>% separate(text, into = paste("V", 1:4, sep = "_"))
#    V_1 V_2 V_3 V_4
# 1    F  US CLE V13
# 2    F  US CA6 U13
# 3    F  US CA6 U13
# 4    F  US CA6 U13
# 5    F  US CA6 U13
# 6    F  US CA6 U13
# 7    F  US CA6 U13
# 8    F  US CA6 U13
# 9    F  US  DL U13
# 10   F  US  DL U13
# 11   F  US  DL U13
# 12   F  US  DL Z13
# 13   F  US  DL Z13
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
18

Is this what you are trying to do?

# Our data
text <- c("F.US.CLE.V13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", 
"F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", 
"F.US.DL.U13", "F.US.DL.U13", "F.US.DL.U13", "F.US.DL.Z13", "F.US.DL.Z13"
)

#  Split into individual elements by the '.' character
#  Remember to escape it, because '.' by itself matches any single character
elems <- unlist( strsplit( text , "\\." ) )

#  We know the dataframe should have 4 columns, so make a matrix
m <- matrix( elems , ncol = 4 , byrow = TRUE )

#  Coerce to data.frame - head() is just to illustrate the top portion
head( as.data.frame( m ) )
#  V1 V2  V3  V4
#1  F US CLE V13
#2  F US CA6 U13
#3  F US CA6 U13
#4  F US CA6 U13
#5  F US CA6 U13
#6  F US CA6 U13
Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
  • This was what I was looking for! And as always with R there are so many ways to get there. Thank you for your help. Just use the merge command to merge the new df back into my main df? – Tim Sep 05 '13 at 17:15
  • @Tim it's a bit hard to say without seeing the structure of your data.frame, but you could give a go and see if it works? Otherwise edit your question pasting in the output from `dput( head( df ) )` for a more reliable answer! :-) – Simon O'Hanlon Sep 05 '13 at 17:17
  • all I needed was `unlist()`. thanks – Amit Kohli Nov 19 '14 at 18:38
  • 1
    Thanks for commenting your code. The comment "Remember to escape it, because '.' by itself matches any single character" helped me fix my issue. Thanks a lot :) – Sriram Nov 30 '16 at 22:28
9

The way via unlist and matrix seems a bit convoluted, and requires you to hard-code the number of elements (this is actually a pretty big no-go. Of course you could circumvent hard-coding that number and determine it at run-time)

I would go a different route, and construct a data frame directly from the list that strsplit returns. For me, this is conceptually simpler. There are essentially two ways of doing this:

  1. as.data.frame – but since the list is exactly the wrong way round (we have a list of rows rather than a list of columns) we have to transpose the result. We also clear the rownames since they are ugly by default (but that’s strictly unnecessary!):

    `rownames<-`(t(as.data.frame(strsplit(text, '\\.'))), NULL)
    
  2. Alternatively, use rbind to construct a data frame from the list of rows. We use do.call to call rbind with all the rows as separate arguments:

    do.call(rbind, strsplit(text, '\\.'))
    

Both ways yield the same result:

     [,1] [,2] [,3]  [,4]
[1,] "F"  "US" "CLE" "V13"
[2,] "F"  "US" "CA6" "U13"
[3,] "F"  "US" "CA6" "U13"
[4,] "F"  "US" "CA6" "U13"
[5,] "F"  "US" "CA6" "U13"
[6,] "F"  "US" "CA6" "U13"
…

Clearly, the second way is much simpler than the first.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • *and requires you to hard-code the number of elements (this is actually a pretty big no-go.* but the number of columns was given and specified in the OP so I don't really have a problem with it. Nice alternatives though. +1 – Simon O'Hanlon Sep 05 '13 at 17:25
  • 1
    In fact I *really* like `do.call(rbind, strsplit(text, '\\.'))` – Simon O'Hanlon Sep 05 '13 at 17:27
1

We could use tidyr::extract()

x <- c("F.US.CLE.V13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", 
  "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", 
  "F.US.DL.U13", "F.US.DL.U13", "F.US.DL.U13", "F.US.DL.Z13", "F.US.DL.Z13"
)


library(tidyr)
extract(tibble(data=x),"data", regex = "^(.*?)\\.(.*?)\\.(.*?)\\.(.*?)$",into = LETTERS[1:4])
#> # A tibble: 13 x 4
#>    A     B     C     D    
#>    <chr> <chr> <chr> <chr>
#>  1 F     US    CLE   V13  
#>  2 F     US    CA6   U13  
#>  3 F     US    CA6   U13  
#>  4 F     US    CA6   U13  
#>  5 F     US    CA6   U13  
#>  6 F     US    CA6   U13  
#>  7 F     US    CA6   U13  
#>  8 F     US    CA6   U13  
#>  9 F     US    DL    U13  
#> 10 F     US    DL    U13  
#> 11 F     US    DL    U13  
#> 12 F     US    DL    Z13  
#> 13 F     US    DL    Z13

Another option is to use unglue::unglue_data()

# remotes::install_github("moodymudskipper/unglue")
library(unglue)
unglue_data(x,"{A}.{B}.{C}.{D}")
#>    A  B   C   D
#> 1  F US CLE V13
#> 2  F US CA6 U13
#> 3  F US CA6 U13
#> 4  F US CA6 U13
#> 5  F US CA6 U13
#> 6  F US CA6 U13
#> 7  F US CA6 U13
#> 8  F US CA6 U13
#> 9  F US  DL U13
#> 10 F US  DL U13
#> 11 F US  DL U13
#> 12 F US  DL Z13
#> 13 F US  DL Z13

Created on 2019-09-14 by the reprex package (v0.3.0)

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167