Extract numbers from strings including '|'

Question

I have data where some of the items are numbers separated by "|", like:

head(mintimes)
[1] "3121|3151" "1171"      "1351|1381" "1050"      ""          "122" 
head(minvalues)
[1] 14    10    11    31 Inf    22

What I would like to do is extract all the times and match them to the minvalues. To end up with something like:

times    values
3121     14
3151     14
1171     10
1351     11
1381     11
1050     31
122      22

I've tried to strsplit(mintimes, "|") and I've tried str_extract(mintimes, "[0-9]+") but they don't seem to work. Any ideas?

Rich Scriven · Accepted Answer · 2014-07-09T23:19:39.680

6

| is a regular expression metacharacter. When used literally, these special characters need to be escaped either with [] or with \\ (or you could use fixed = TRUE in some functions). So your call to strsplit() should be

strsplit(mintimes, "[|]")

or

strsplit(mintimes, "\\|")

or

strsplit(mintimes, "|", fixed = TRUE)

Regarding your other try with stringr functions, str_extract_all() seems to do the trick.

library(stringr)
str_extract_all(mintimes, "[0-9]+")

To get your desired result,

> mintimes <- c("3121|3151", "1171", "1351|1381", "1050", "", "122") 
> minvalues <- c(14, 10, 11, 31, Inf, 22)
> s <- strsplit(mintimes, "[|]")
> data.frame(times = as.numeric(unlist(s)), 
             values = rep(minvalues, sapply(s, length)))
#   times values
# 1  3121     14
# 2  3151     14
# 3  1171     10
# 4  1351     11
# 5  1381     11
# 6  1050     31
# 7   122     22

edited Jul 09 '14 at 23:19

answered Jun 17 '14 at 01:01

Rich Scriven

97,041
11
181
245

Ok mister quick draw. ;) – jbaums Jun 17 '14 at 01:02
I thought OP wanted not only to split on the vertical pipe but also match the values to the times. One approach might be to duplicate the times variable in times2 column, then write something like ifelse(grep("\\|"), do nothing (i.e., leave both times1 and times2 as is, else "" to delete the unneeded time2. Then, create a data frame in the format of his example. Such wizardry is manifestly beyond this newbie. – lawyeR Jun 17 '14 at 01:36
@user2583119, I've updated my answer to include the desired result. – Rich Scriven Jun 17 '14 at 01:54

score 4 · Answer 2 · answered Jun 17 '14 at 01:02

4

By default strsplit splits using a regular expression and "|" is a special character in the regular expression syntax. You can either escape it

strsplit(mintimes,"\\|")

or just set fixed=T to not use regular expressions

strsplit(mintimes,"|", fixed=T)

answered Jun 17 '14 at 01:02

MrFlick

195,160
17
277
295

score 3 · Answer 3 · answered Jun 17 '14 at 02:03

3

I have written a function called cSplit that is useful for these types of things. You can get it from my Gist: https://gist.github.com/mrdwab/11380733

Usage would be:

cSplit(data.table(mintimes, minvalues), "mintimes", "|", "long")
#    mintimes minvalues
# 1:     3121        14
# 2:     3151        14
# 3:     1171        10
# 4:     1351        11
# 5:     1381        11
# 6:     1050        31
# 7:      122        22

It also has a "wide" setting, in case that would be at all useful to you:

cSplit(data.table(mintimes, minvalues), "mintimes", "|", "wide")
#    minvalues mintimes_1 mintimes_2
# 1:        14       3121       3151
# 2:        10       1171         NA
# 3:        11       1351       1381
# 4:        31       1050         NA
# 5:       Inf         NA         NA
# 6:        22        122         NA

Note: The output is a data.table.

answered Jun 17 '14 at 02:03

A5C1D2H2I1M1N2O1R2T1

190,393
28
405
485

This could probably be updated to `library(splitstackshape)` now, right? – Rich Scriven Oct 19 '14 at 04:16
@RichardScriven, there are probably at least a couple of dozen answers on SO that have `cSplit` as part of the answer--not prepared to update them all (also because that bumps questions to the front of the "active" page and I wouldn't want that to happen). – A5C1D2H2I1M1N2O1R2T1 Oct 19 '14 at 08:45
@RichardScriven, as an alternative, I've added a `message` to the `cSplit` Gist that mentions that it's now part of "splitstackshape". That should help :-) – A5C1D2H2I1M1N2O1R2T1 Oct 19 '14 at 08:51

jbaums · Answer 4 · 2014-06-17T01:17:07.733

As others have mentioned, you need to escape the | to include it literally in a regular expression. As always, we can skin this cat many ways, and here's one way to do it with stringr:

x <- c("3121|3151", "1171", "1351|1381", "1050", "", "122")

library(stringr)
unlist(str_extract_all(x, "\\d+"))

# [1] "3121" "3151" "1171" "1351" "1381" "1050" "122"

This won't work as expected if you have any decimal points in a character string of numbers, so the following (which says to match anything but |) might be safer:

unlist(str_extract_all(x, '[^|]+'))

# [1] "3121" "3151" "1171" "1351" "1381" "1050" "122"

Either way, you might want to wrap the result in as.numeric.

score 2 · Answer 5 · answered Oct 01 '14 at 08:52

And here's another solution using stri_split_fixed from the stringi package. As an added value, we also play with mapply and do.call.

Input data:

mintimes <- c("3121|3151", "1171", "1351|1381", "1050", "", "122") 
minvalues <- c(14, 10, 11, 31, Inf, 22)

Split mintimes w.r.t. | and convert to numeric:

library("stringi")    
mintimes <- lapply(stri_split_fixed(mintimes, "|"), as.numeric)
## [[1]]
## [1] 3121 3151
## 
## [[2]]
## [1] 1171
## 
## [[3]]
## [1] 1351 1381
## 
## [[4]]
## [1] 1050
## 
## [[5]]
## [1] NA
## 
## [[6]]
## [1] 122

Column-bind each minvalues with corresponding mintimes:

tmp <- mapply(cbind, mintimes, minvalues)
## [[1]]
##      [,1] [,2]
## [1,] 3121   14
## [2,] 3151   14
## 
## [[2]]
##      [,1] [,2]
## [1,] 1171   10
## 
## [[3]]
##      [,1] [,2]
## [1,] 1351   11
## [2,] 1381   11
## 
## [[4]]
##      [,1] [,2]
## [1,] 1050   31
## 
## [[5]]
##      [,1] [,2]
## [1,]   NA  Inf
## 
## [[6]]
##      [,1] [,2]
## [1,]  122   22

Row-bind all the 6 matrices & remove NA-rows:

res <- do.call(rbind, tmp)
res[!is.na(res[,1]),]
##      [,1] [,2]
## [1,] 3121   14
## [2,] 3151   14
## [3,] 1171   10
## [4,] 1351   11
## [5,] 1381   11
## [6,] 1050   31
## [7,]  122   22

score 1 · Answer 6 · edited May 23 '17 at 11:58

To get the output you want, try something like this:

library(dplyr)

Split.Times <- function(x) {
  mintimes <- as.numeric(unlist(strsplit(as.character(x$mintimes), "\\|")))
  return(data.frame(mintimes = mintimes, minvalues = x$minvalues, stringsAsFactors=FALSE))
}

df <- data.frame(mintimes, minvalues, stringsAsFactors=FALSE)

df %>%
  filter(mintimes != "") %>%
  group_by(mintimes) %>%
  do(Split.Times(.))

This produces:

  mintimes minvalues
1     1050        31
2     1171        10
3      122        22
4     1351        11
5     1381        11
6     3121        14
7     3151        14

(I borrowed from my answer here - which is pretty much the same question/problem)

score 1 · Answer 7 · answered Oct 19 '14 at 04:07

Here's a qdap package approach:

mintimes <- c("3121|3151", "1171", "1351|1381", "1050", "", "122") 
minvalues <- c(14, 10, 11, 31, Inf, 22)

library(qdap)
list2df(setNames(strsplit(mintimes, "\\|"), minvalues), "times", "values")

##   times values
## 1  3121     14
## 2  3151     14
## 3  1171     10
## 4  1351     11
## 5  1381     11
## 6  1050     31
## 7   122     22

score 0 · Answer 8 · answered Jun 17 '14 at 01:00

0

You can use [:punct:]

strsplit(mintimes, "[[:punct:]]")

answered Jun 17 '14 at 01:00

ThatGuy

1,225
10
28

I wouldn't recommend this in general. If you have any decimal points in a character vector of numbers, and use that pattern, you will break your data. – jbaums Jun 17 '14 at 01:03

Extract numbers from strings including '|'

8 Answers8