splitting only once with multiple instances of a delimiter using R or Pyspark

Question

I don't know if I can post a question for 2 separate programming languages in the same post. If not, I will be happy to post them as separate questions. A solution in either one of the languages (R or Pyspark) would suffice.

I am trying to split a column into 2 columns using the _ delimiter only on the first instance. However, there are values that have more than one instance of the delimiter. For example:

Input       Expected Output
BCT_12      [BCT] [12]
HDR_40_50   [HDR] [40_50]
500_ERP_70  [500] [ERP_70]

Currently, I am counting the number of _ and using a for loop to split all of them and concatenate the columns after the first instance. But it is quite clunky and prone to breaking. Would love to know if there is an elegant method to do this.

Does this answer your question? [Split data frame string column into multiple columns](https://stackoverflow.com/questions/4350440/split-data-frame-string-column-into-multiple-columns) — wp78de, Sep 08 '20 at 21:40

score 1 · Accepted Answer · answered Sep 08 '20 at 20:44

On R, it's very easy using stringr (part of the tidyverse):

xx <- c("BCT_12", "HDR_40_50", "500_ERP_70")
stringr::str_split_fixed(xx, "_", n=2)
#     [,1]  [,2]    
#[1,] "BCT" "12"    
#[2,] "HDR" "40_50" 
#[3,] "500" "ERP_70"

And if your data is already in a data.frame, you can use separate() from tidyr:

library(tidyverse)
tibble(value = xx) %>%
  separate(value, into = c("first", "rest"), "_", extra = "merge")
# A tibble: 3 x 2
#  first rest  
#  <chr> <chr> 
#1 BCT   12    
#2 HDR   40_50 
#3 500   ERP_70

splitting only once with multiple instances of a delimiter using R or Pyspark

1 Answers1