1

I have a data set of strings and want to extract a substring up to and including the first colon. Earlier I posted here asking how to extract just the portion after the first colon: Split strings at the first colon Below I list a few of my attempts at solving the current problem.

I know that ^[^:]+: matches the portion I want to keep, but I cannot figure out how to extract that portion.

Here is an example data set and the desired result.

my.data <- "here is: some text
here is some more.
even: more text
still more text
this text keeps: going."

my.data2 <- readLines(textConnection(my.data))

desired.result <- "here is:
0
even:
0
this text keeps:"

desired.result2 <- readLines(textConnection(desired.result))

# Here are some of my attempts

# discards line 2 and 4 but does not extract portion from lines 1,3, and 5.
ifelse( my.data2 == gsub("^[^:]+:", "", my.data2), '', my.data2)

# returns the portion I do not want rather than the portion I do want
sub("^[^:]+:", "\\1", my.data2, perl=TRUE)

# returns an entire line if it contains a colon
grep("^[^:]+:", my.data2, value=TRUE)

# identifies which rows contain a match
regexpr("^[^:]+:", my.data2)

# my attempt at anchoring the right end instead of the left end
regexpr("[^:]+:$", my.data2)

This earlier question concerns returning the opposite of a match. I have not figured out how to implement this solution in R if I start with the solution to my earlier question linked above: Regular Expression Opposite

I have recently obtained RegexBuddy to study regular expressions. That is how I know ^[^:]+: matches what I want. I just have not been able to use that information to extract the matches.

I am aware of the stringr package. Perhaps it can help, but I much prefer a solution in base R.

Thank you for any advice.

Community
  • 1
  • 1
Mark Miller
  • 12,483
  • 23
  • 78
  • 132
  • 2
    I think you are just missing the capturing parentheses, `(` and `)` – your expression including them would be `^([^:]+:)` – CBroe Mar 16 '13 at 21:24
  • I think what you are looking for is regex groups. Maybe this helps http://stackoverflow.com/questions/952275/regex-group-capture-in-r ? – ffledgling Mar 16 '13 at 21:24

3 Answers3

6

"I know that ^[^:]+: matches the portion I want to keep, but I cannot figure out how to extract that portion."

So just wrap parens around that and add ".+$" to the end and use sub with a reference

sub("(^[^:]+:).+$", "\\1", vec)

 step1 <- sub("^([^:]+:).+$", "\\1", my.data2)
 step2 <- ifelse(grepl(":", step1), step1, 0)
 step2
#[1] "here is:"         "0"                "even:"            "0"               
#[5] "this text keeps:"

It wasn't clear whether you wanted those as separate vector elements of to have them pasted together with linefeeds:

> step3 <- paste0(step2, collapse="\n")
> step3
[1] "here is:\n0\neven:\n0\nthis text keeps:"
> cat(step3)
here is:
0
even:
0
this text keeps:
IRTFM
  • 258,963
  • 21
  • 364
  • 487
4

This seems to produce what you're looking for (though it returns only the bits of lines that have a colon in them):

grep(":",gsub("(^[^:]+:).*$","\\1",my.data2 ),value=TRUE)
[1] "here is:"         "even:"            "this text keeps:"

As I was typing this I saw the the @DWin's answer that also suggested parens and has the ifelse which does give you the "0"'s as well.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
2

Another less elegant approach with strsplit:

x <- strsplit(my.data2, ":")
lens <- sapply(x, length)
y <- sapply(x, "[", 1)
y[lens==1] <- "0"
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519