0

I am trying to figure it out how I can use tstrisplit() function from data.table to split a text by location number. I am aware of the Q1, Q2 & Q3 but these do not address my question.

as an example :

 DT2 <- data.table(a = paste0(LETTERS[1:5],seq(10,15)), b = runif(6))
 DT2
     a         b  
1: A10 0.4153622
2: B11 0.1567381
3: C12 0.5361883
4: D13 0.5920144
5: E14 0.3376648
6: A15 0.5503773

I tried the following which did not work:

DT2[, c("L", "D") := tstrsplit(a, "")][]
DT2[, c("L", "D") := tstrsplit(a, "[A-Z]")][]
DT2[, c("L", "D") := tstrsplit(a, "[0-9]{1}")][]

The expectation:

     a         b    L   D
1: A10 0.4153622    A   10
2: B11 0.1567381    B   11
3: C12 0.5361883    C   12
4: D13 0.5920144    D   13
5: E14 0.3376648    E   14
6: A15 0.5503773    A   15

any help with explanation is highly appreciated.

Psidom
  • 209,562
  • 33
  • 339
  • 356
Daniel
  • 1,202
  • 2
  • 16
  • 25

1 Answers1

1

You can split on regex "(?<=[A-Za-z])(?=[0-9])" if you want to split between letters and digits, (?<=[A-Za-z])(?=[0-9]) restricts the split to a position that is preceded by a letter and followed by a digit:

The regex contains two parts, look behind (?<=[A-Za-z]) which means after a letter and look ahead (?=[0-9]), i.e before a digit, see more about regex look around, in r, you need to specify perl=TRUE to use Perl-compatible regexps to make these work:

DT2[, c("L", "D") := tstrsplit(a, "(?<=[A-Za-z])(?=[0-9])", perl=TRUE)][]

#     a          b L  D
#1: A10 0.01487372 A 10
#2: B11 0.95035709 B 11
#3: C12 0.49230300 C 12
#4: D13 0.67183871 D 13
#5: E14 0.40076579 E 14
#6: A15 0.27871477 A 15
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • Thanks for the answer, Would you please let me know why you assigned to `?` or in another words what dose `(?<=[A-Za-z])` means ? I know regex but I do not know why you assign to `?`. Furthermore, what dose `perl = TRUE` mean here as it dose not explained/defined in package? – Daniel Jul 24 '17 at 21:06
  • 2
    When `?` is the first character in a regex group, it signals there will be extra options for the group. In this case, "(?<=[A-Za-z])` means "preceded by `[A-Za-z]`, but don't include this group in the match." Similarly, `(?=[0-9])` means "followed by a digit, but don't include this group in the match." – Nathan Werth Jul 25 '17 at 16:01