Splitting a dataframe string column into multiple different columns

Question

What I am trying to accomplish is splitting a column into multiple columns. I would prefer the first column to contain "F", second column "US", third "CA6" or "DL", and the fourth to be "Z13" or "U13" etc etc. My entire df follows the same pattern of X.XX.XXXX.XXX or X.XX.XXX.XXX or X.XX.XX.XXX and I know the third column is where my problem lies because of the different lengths. I have only used substr in the past and I could use that here with some if statements but would like to learn how to use stringr package and POSIX to do this (unless there is a better option). Thank you in advance.

Here is my df:

c("F.US.CLE.V13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", 
"F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", 
"F.US.DL.U13", "F.US.DL.U13", "F.US.DL.U13", "F.US.DL.Z13", "F.US.DL.Z13"
)

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2014-10-12T16:26:55.387

57

A very direct way is to just use read.table on your character vector:

> read.table(text = text, sep = ".", colClasses = "character")
   V1 V2  V3  V4
1   F US CLE V13
2   F US CA6 U13
3   F US CA6 U13
4   F US CA6 U13
5   F US CA6 U13
6   F US CA6 U13
7   F US CA6 U13
8   F US CA6 U13
9   F US  DL U13
10  F US  DL U13
11  F US  DL U13
12  F US  DL Z13
13  F US  DL Z13

colClasses needs to be specified, otherwise F gets converted to FALSE (which is something I need to fix in "splitstackshape", otherwise I would have recommended that :) )

Update (> a year later)...

Alternatively, you can use my cSplit function, like this:

cSplit(as.data.table(text), "text", ".")
#     text_1 text_2 text_3 text_4
#  1:      F     US    CLE    V13
#  2:      F     US    CA6    U13
#  3:      F     US    CA6    U13
#  4:      F     US    CA6    U13
#  5:      F     US    CA6    U13
#  6:      F     US    CA6    U13
#  7:      F     US    CA6    U13
#  8:      F     US    CA6    U13
#  9:      F     US     DL    U13
# 10:      F     US     DL    U13
# 11:      F     US     DL    U13
# 12:      F     US     DL    Z13
# 13:      F     US     DL    Z13

Or, separate from "tidyr", like this:

library(dplyr)
library(tidyr)

as.data.frame(text) %>% separate(text, into = paste("V", 1:4, sep = "_"))
#    V_1 V_2 V_3 V_4
# 1    F  US CLE V13
# 2    F  US CA6 U13
# 3    F  US CA6 U13
# 4    F  US CA6 U13
# 5    F  US CA6 U13
# 6    F  US CA6 U13
# 7    F  US CA6 U13
# 8    F  US CA6 U13
# 9    F  US  DL U13
# 10   F  US  DL U13
# 11   F  US  DL U13
# 12   F  US  DL Z13
# 13   F  US  DL Z13

edited Oct 12 '14 at 16:26

answered Sep 05 '13 at 17:10

A5C1D2H2I1M1N2O1R2T1

190,393
28
405
485

1

+1 dammit Ananda. Make me feel dumb you do. :-) – Simon O'Hanlon Sep 05 '13 at 17:14
And by `splitstackshape` don't you mean `shapeshifter`? – Simon O'Hanlon Sep 05 '13 at 17:15
WOW. This was so simple. – Tim Sep 05 '13 at 17:19
`shapeshifter` or `shapeshiftR` is way cooler. – Tyler Rinker Sep 05 '13 at 17:54
And now the check mark is yours! – Simon O'Hanlon Sep 05 '13 at 18:32
@SimonO101 sorry. This post just became very detailed in the different ways this could be accomplished! Thank you for your help!!! – Tim Sep 05 '13 at 18:47
Updated a year later with more reasonable (and faster) alternatives. – A5C1D2H2I1M1N2O1R2T1 Oct 12 '14 at 16:27
The `cSplit` function is really awesome, thanks a lot. – bim Mar 10 '16 at 13:26

score 18 · Answer 2 · answered Sep 05 '13 at 17:01

18

Is this what you are trying to do?

# Our data
text <- c("F.US.CLE.V13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", 
"F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", 
"F.US.DL.U13", "F.US.DL.U13", "F.US.DL.U13", "F.US.DL.Z13", "F.US.DL.Z13"
)

#  Split into individual elements by the '.' character
#  Remember to escape it, because '.' by itself matches any single character
elems <- unlist( strsplit( text , "\\." ) )

#  We know the dataframe should have 4 columns, so make a matrix
m <- matrix( elems , ncol = 4 , byrow = TRUE )

#  Coerce to data.frame - head() is just to illustrate the top portion
head( as.data.frame( m ) )
#  V1 V2  V3  V4
#1  F US CLE V13
#2  F US CA6 U13
#3  F US CA6 U13
#4  F US CA6 U13
#5  F US CA6 U13
#6  F US CA6 U13

answered Sep 05 '13 at 17:01

Simon O'Hanlon

58,647
14
142
184

This was what I was looking for! And as always with R there are so many ways to get there. Thank you for your help. Just use the merge command to merge the new df back into my main df? – Tim Sep 05 '13 at 17:15
@Tim it's a bit hard to say without seeing the structure of your data.frame, but you could give a go and see if it works? Otherwise edit your question pasting in the output from `dput( head( df ) )` for a more reliable answer! :-) – Simon O'Hanlon Sep 05 '13 at 17:17
all I needed was `unlist()`. thanks – Amit Kohli Nov 19 '14 at 18:38
1

Thanks for commenting your code. The comment "Remember to escape it, because '.' by itself matches any single character" helped me fix my issue. Thanks a lot :) – Sriram Nov 30 '16 at 22:28

score 9 · Answer 3 · answered Sep 05 '13 at 17:18

The way via unlist and matrix seems a bit convoluted, and requires you to hard-code the number of elements (this is actually a pretty big no-go. Of course you could circumvent hard-coding that number and determine it at run-time)

I would go a different route, and construct a data frame directly from the list that strsplit returns. For me, this is conceptually simpler. There are essentially two ways of doing this:

as.data.frame – but since the list is exactly the wrong way round (we have a list of rows rather than a list of columns) we have to transpose the result. We also clear the rownames since they are ugly by default (but that’s strictly unnecessary!):
```
`rownames<-`(t(as.data.frame(strsplit(text, '\\.'))), NULL)
```
Alternatively, use rbind to construct a data frame from the list of rows. We use do.call to call rbind with all the rows as separate arguments:
```
do.call(rbind, strsplit(text, '\\.'))
```

Both ways yield the same result:

     [,1] [,2] [,3]  [,4]
[1,] "F"  "US" "CLE" "V13"
[2,] "F"  "US" "CA6" "U13"
[3,] "F"  "US" "CA6" "U13"
[4,] "F"  "US" "CA6" "U13"
[5,] "F"  "US" "CA6" "U13"
[6,] "F"  "US" "CA6" "U13"
…

Clearly, the second way is much simpler than the first.

*and requires you to hard-code the number of elements (this is actually a pretty big no-go.* but the number of columns was given and specified in the OP so I don't really have a problem with it. Nice alternatives though. +1 — Simon O'Hanlon, Sep 05 '13 at 17:25
In fact I *really* like `do.call(rbind, strsplit(text, '\\.'))` — Simon O'Hanlon, Sep 05 '13 at 17:27

score 1 · Answer 4 · answered Sep 14 '19 at 00:51

We could use tidyr::extract()

x <- c("F.US.CLE.V13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", 
  "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", 
  "F.US.DL.U13", "F.US.DL.U13", "F.US.DL.U13", "F.US.DL.Z13", "F.US.DL.Z13"
)


library(tidyr)
extract(tibble(data=x),"data", regex = "^(.*?)\\.(.*?)\\.(.*?)\\.(.*?)$",into = LETTERS[1:4])
#> # A tibble: 13 x 4
#>    A     B     C     D    
#>    <chr> <chr> <chr> <chr>
#>  1 F     US    CLE   V13  
#>  2 F     US    CA6   U13  
#>  3 F     US    CA6   U13  
#>  4 F     US    CA6   U13  
#>  5 F     US    CA6   U13  
#>  6 F     US    CA6   U13  
#>  7 F     US    CA6   U13  
#>  8 F     US    CA6   U13  
#>  9 F     US    DL    U13  
#> 10 F     US    DL    U13  
#> 11 F     US    DL    U13  
#> 12 F     US    DL    Z13  
#> 13 F     US    DL    Z13

Another option is to use unglue::unglue_data()

# remotes::install_github("moodymudskipper/unglue")
library(unglue)
unglue_data(x,"{A}.{B}.{C}.{D}")
#>    A  B   C   D
#> 1  F US CLE V13
#> 2  F US CA6 U13
#> 3  F US CA6 U13
#> 4  F US CA6 U13
#> 5  F US CA6 U13
#> 6  F US CA6 U13
#> 7  F US CA6 U13
#> 8  F US CA6 U13
#> 9  F US  DL U13
#> 10 F US  DL U13
#> 11 F US  DL U13
#> 12 F US  DL Z13
#> 13 F US  DL Z13

^{Created on 2019-09-14 by the reprex package (v0.3.0)}

Splitting a dataframe string column into multiple different columns

4 Answers4

Update (> a year later)...

Linked

Related