split string without loss of characters

Question

I wish to split strings at a certain character while retaining that character in the second resulting string. I can achieve almost all of the desired operation, except that I lose the characters I specify in strsplit, which I guess is called the delimiter.

Is there a way to request that strsplit retain the delimiter? Or must I use a regular expression of some kind? Thank you for any advice. This seems like a very basic question. Sorry if it is a duplicate. I prefer to use base R.

Here is an example showing what I have so far:

my.table <- read.table(text = '
                                                            model npar     AICc 
 AA(~region+state+county+city)BB(~region+state+county+city)CC(~1)   17 11111.11
         AA(~region+state+county)BB(~region+state+county)CC(~123)   14 22222.22
                        AA(~region+state)BB(~region+state)CC(~33)   13 33333.33
                                  AA(~region)BB(~region)CC(~4321)    6 44444.44
', header = TRUE, stringsAsFactors = FALSE)

desired.result <- read.table(text = '
                                                      model        CC npar     AICc
 AA(~region+state+county+city)BB(~region+state+county+city)    CC(~1)   17 11111.11
           AA(~region+state+county)BB(~region+state+county)  CC(~123)   14 22222.22
                         AA(~region+state)BB(~region+state)   CC(~33)   13 33333.33
                                     AA(~region)BB(~region) CC(~4321)    6 44444.44
', header = TRUE, stringsAsFactors = FALSE)

split.model  <- strsplit(my.table$model, 'CC\\(')

split.models <- matrix(unlist(split.model), ncol=2, byrow=TRUE, dimnames = list(NULL, c("model", "CC")))

desires.result2 <- data.frame(split.models, my.table[,2:ncol(my.table)])
desires.result2

#                                                       model     CC npar     AICc
# 1 AA(~region+state+county+city)BB(~region+state+county+city)    ~1)   17 11111.11
# 2           AA(~region+state+county)BB(~region+state+county)  ~123)   14 22222.22
# 3                         AA(~region+state)BB(~region+state)   ~33)   13 33333.33
# 4                                     AA(~region)BB(~region) ~4321)    6 44444.44

score 9 · Accepted Answer · edited May 23 '17 at 11:49

The basic idea is to use look-around operations from regular expressions to strsplit to get your desired result. However, it's a bit trickier than that with strsplit and positive lookahead. Read this excellent post from @JoshO'Brien for explanation.

pattern <- "(?<=\\))(?=CC)"
strsplit(my.table$model, pattern, perl=TRUE)
# [[1]]
# [1] "AA(~region+state+county+city)BB(~region+state+county+city)"
# [2] "CC(~1)"                                                    

# [[2]]
# [1] "AA(~region+state+county)BB(~region+state+county)"
# [2] "CC(~123)"                                        

# [[3]]
# [1] "AA(~region+state)BB(~region+state)" "CC(~33)"                           

# [[4]]
# [1] "AA(~region)BB(~region)" "CC(~4321)"

Of course, I leave the task of do.call(rbind, ...) and cbind to get the final desired.output to you.

score 0 · Answer 2 · answered Jul 12 '13 at 20:18

Almost right after I posted I thought of using gsub to insert a space and then split on the space. Although, I like Arun's answer better.

my.table <- read.table(text = '
                                                            model npar     AICc 
 AA(~region+state+county+city)BB(~region+state+county+city)CC(~1)   17 11111.11
         AA(~region+state+county)BB(~region+state+county)CC(~123)   14 22222.22
                        AA(~region+state)BB(~region+state)CC(~33)   13 33333.33
                                  AA(~region)BB(~region)CC(~4321)    6 44444.44
', header = TRUE, stringsAsFactors = FALSE)

my.table$model <- gsub("CC", " CC", my.table$model)

split.model <- strsplit(my.table$model, ' ')

split.models <- matrix(unlist(split.model), ncol=2, byrow=TRUE, dimnames = list(NULL, c("model", "CC")))

desires.result <- data.frame(split.models, my.table[,2:ncol(my.table)])
desires.result

#                                                        model        CC npar     AICc
# 1 AA(~region+state+county+city)BB(~region+state+county+city)    CC(~1)   17 11111.11
# 2           AA(~region+state+county)BB(~region+state+county)  CC(~123)   14 22222.22
# 3                         AA(~region+state)BB(~region+state)   CC(~33)   13 33333.33
# 4                                     AA(~region)BB(~region) CC(~4321)    6 44444.44

if you're gonna `sub`, then just do `sub('.*(CC.*)', '\\1', model)` and `sub('CC.*', '', model)` to get the two parts (assuming you have 2 parts) — eddi, Jul 12 '13 at 20:21

score 0 · Answer 3 · answered Jul 13 '13 at 05:22

0

... why not just tack the separator back on afterwards? Would seem to save a lot of trouble fiddling with regexes.

split.model <- lapply(strsplit(my.table$model, 'CC\\('), function(x) {
    x[2] <- paste0("CC(", x[2])
    x
})

answered Jul 13 '13 at 05:22

Hong Ooi

56,353
13
134
187

Yes but: 1) the limitation of this method comes (not for this question, but in general) when one wants to search, for example, CA, CB, CC, CD and CE and split the string, and if CF, CG, ... *don't*. 2) You're essentially looping over all the rows and pasting once again which *may not* be efficient on larger data (not benchmarked yet). – Arun Jul 14 '13 at 11:59
@arun It was an answer to the specific question posed: how do you search for a particular string without getting rid of it. And unless you're dealing with a stupidly large number of cases (millions?) all of the solutions posed are basically instantaneous. Besides, as the saying goes, you have execution time and you have writing time. The time taken coming up with an appropriate regex would probably exceed the time taken to run it. – Hong Ooi Jul 14 '13 at 13:17
Actually, on a 40k row data.frame, this solution takes 0.8 seconds, whereas the regexp solution takes 0.065 seconds. Now, we can debate on whether 0.8 seconds is a lot of time or not in coding sense. I think I already mentioned that the limitation is in a "general" scenario. However the trend I've observed on SO (at least under R-tag) is to provide a *general* and *efficient* solution where possible. Actually, to write my solution takes 72 characters where as yours'll definitely be more. So what you ***really*** mean is ***thinking time***. I guess I've a different take on that. – Arun Jul 14 '13 at 13:42

split string without loss of characters

3 Answers3

Linked